Hi Ed, Thanks very much for your reply. I think it helped a lot, but I may be a bit confused about conditional versus unconditional modeling. What I'm doing is similar to text categorization. I observe some text (vector X) and want to determine if a binary label (scalar y) should be applied (y=1) or not (y=0). So, I look at this as using maxent to estimate P(y|X). In this case, is my sample space simply {0,1} or is it the space from which X is sampled? I had thought this was conditional modelling, but I don't want to explicitly train models for every different X, rather I want to select and weight features from the whole corpus that in turn imply, for each X', some P(y|X'). I assumed it was conditional because I'm not modelling P(y',X'). I have been trying to build models using a sample space with tuples as elements like (X_n,y_n). I then am greedily building a set of feature functions using information gain to select features. It's not quite working, and I'm worried I'm not defining the sample space properly. I will try to figure out how to define the sample space as {0,1} and then redefine the features, but I'd appreciate any advice. Also, if I get this working, I'd be happy to try and help with adding some feature selection examples to your documentation. At the moment, I'm still, as you can tell, figuring out what I'm doing. Thanks, Matt
Hi Matt, I've just read a paper on maxent text classification (by Nigam, Lafferty and McCallam, 1999) to try to understand how you'd do this. In this paper the model was p(y|x), like you were intending. So the sample space is {0,1}. They express features as functions f_i(x, y) of both the class and the text. Fine ... but now I can understand why you don't want to explicitly train models for each x, which might not even be possible; instead you want a model with one set of features {f_i, i=1,...,m} and one set of corresponding parameters. Hmmm ... so a conditional framework is trickier than I thought... ... And I've now just re-read Malouf's (2002) paper. I think fitting a conditional model just requires proper handling of the normalization constant. So I'll thrash out some code in the next few days and get back to you :) -- Ed On 14/03/2006, at 2:46 AM, Matthew Cooper wrote:
Hi Ed,
Thanks very much for your reply. I think it helped a lot, but I may be a bit confused about conditional versus unconditional modeling.
What I'm doing is similar to text categorization. I observe some text (vector X) and want to determine if a binary label (scalar y) should be applied (y=1) or not (y=0). So, I look at this as using maxent to estimate P(y|X). In this case, is my sample space simply {0,1} or is it the space from which X is sampled? I had thought this was conditional modelling, but I don't want to explicitly train models for every different X, rather I want to select and weight features from the whole corpus that in turn imply, for each X', some P(y|X'). I assumed it was conditional because I'm not modelling P(y',X').
I have been trying to build models using a sample space with tuples as elements like (X_n,y_n). I then am greedily building a set of feature functions using information gain to select features. It's not quite working, and I'm worried I'm not defining the sample space properly. I will try to figure out how to define the sample space as {0,1} and then redefine the features, but I'd appreciate any advice. Also, if I get this working, I'd be happy to try and help with adding some feature selection examples to your documentation. At the moment, I'm still, as you can tell, figuring out what I'm doing.
Thanks, Matt
_______________________________________________ SciPy-user mailing list SciPy-user@scipy.net http://www.scipy.net/mailman/listinfo/scipy-user
Hi again, Matt ... I hope you don't mind my forwarding this to the list too:
I appreciate your help with this. I'm not sure a ton of stuff will have to be changed. Perhaps the model may need a field for observable data, so the features can map (x,y) to a scalar. I think the probdist/pmf method in your current model class will need to change, maybe to a method for evaluating P(y|X) for a specific document X.
I'm really happy to see this module in Scipy, so if there's anything I can do to help, please let me know.
Now I've been praised in public! I feel warm and fuzzy. My current thinking is to create a new class, e.g. 'conditionalmodel', that derives from the model class and overrides the necessary methods. I've written most of the code I think is necessary for this, with a couple of example scripts with comments to try to explain what's going on. It's not yet working fully, but I've now made it available anyway in my development branch. You can get it using svn checkout http://svn.scipy.org/svn/scipy/branches/ejs scipy_ejs The two example scripts are in the maxentropy/examples/ directory. Matt, would you like to take it from here? My implementation is based on a paper by Robert Malouf, "A comparison of algorithms for maximum entropy parameter estimation", 2002. He also made the source code available for his implementation, which is now at http://tadm.sourceforge.net/. I've used this for inspiration, and it probably deserves more careful study. -- Ed
Hi Ed, Thanks again for working on this. I can try and work on it a bit this weekend. I've had time to look over the two example scripts you provided. There seemed to be some difference in the two in terms of the call to the conditionalmodel fit method. In the low level example, the count parameter seemed to provide the empirical counts of the feature functions, where the features were simply (context,label) co-occurrence. In the high level example, the features are more complicated, and the counts parameter seems to have different dimensionality. I'll try and get a working high level example together next. /mc On 3/17/06, Ed Schofield <schofield@ftw.at> wrote:
Hi again, Matt ... I hope you don't mind my forwarding this to the list too:
I appreciate your help with this. I'm not sure a ton of stuff will have to be changed. Perhaps the model may need a field for observable data, so the features can map (x,y) to a scalar. I think the probdist/pmf method in your current model class will need to change, maybe to a method for evaluating P(y|X) for a specific document X.
I'm really happy to see this module in Scipy, so if there's anything I can do to help, please let me know.
Now I've been praised in public! I feel warm and fuzzy.
My current thinking is to create a new class, e.g. 'conditionalmodel', that derives from the model class and overrides the necessary methods. I've written most of the code I think is necessary for this, with a couple of example scripts with comments to try to explain what's going on. It's not yet working fully, but I've now made it available anyway in my development branch. You can get it using
svn checkout http://svn.scipy.org/svn/scipy/branches/ejs scipy_ejs
The two example scripts are in the maxentropy/examples/ directory.
Matt, would you like to take it from here? My implementation is based on a paper by Robert Malouf, "A comparison of algorithms for maximum entropy parameter estimation", 2002. He also made the source code available for his implementation, which is now at http://tadm.sourceforge.net/. I've used this for inspiration, and it probably deserves more careful study.
-- Ed
On 18/03/2006, at 1:31 AM, Matthew Cooper wrote:
Hi Ed,
Thanks again for working on this. I can try and work on it a bit this weekend. I've had time to look over the two example scripts you provided. There seemed to be some difference in the two in terms of the call to the conditionalmodel fit method. In the low level example, the count parameter seemed to provide the empirical counts of the feature functions, where the features were simply (context,label) co-occurrence. In the high level example, the features are more complicated, and the counts parameter seems to have different dimensionality. I'll try and get a working high level example together next.
Hi Matt, I've now found and fixed some bugs in the conditional maxent code. The computation of the conditional expectations was wrong, and the p_tilde parameter was interpreted inconsistently. Both the examples work now! Fantastic! I'd be very grateful for any assistance you could give in providing more examples -- especially real examples from text classification. The two examples at the moment are too artificial and perhaps a bit confusing. Or if you have any suggestions or patches for simplifying the interface (e.g. the constructor arguments) or any other improvements (e.g. bug fixes, better docs, or a tutorial) I'd also readily merge them. Let me know how you go with it. When you're happy that it's all working, I'll merge it with the main SVN trunk. -- Ed
Hi Ed, I am playing around with the code on some more small examples and everything has been fine. The thing that will hold me back from testing on larger datasets is the F matrix which thus far requires the space of (context,label) pairs to be enumerable. I know that internally you are using a sparse representation for this matrix. Can I initialize the model with a sparse matrix also? This also requires changes with the indices_context parameter in the examples. I see that you also have an unconditional bigmodel class that seems related, but I'm not sure what would need to be changed. For a conditional model, computing the feature expectation under the current model still requires knowledge of the training samples. So what I think would make sense is to use two sparse matrices. One matrix needs to represent the training data (our model is for q(x|w) but we still use the empirical p(w) as the prior on the context when computing the feature expectations under the model (so we don't need to consider the whole exponential space of possible contexts). This is shown in Malouf's paper in the equation for the log-likelihood (2) and the second equation in Sec 2.1). Each feature then maps the training data to the corresponding feature output. This requires an N vector per feature so a N by (#features) sparse matrix could be used for F. Does this make sense? I should be able to test on some standard datasets if we can figure out how to handle the larger context spaces that come with larger text collections. Matt On 3/18/06, Ed Schofield <schofield@ftw.at> wrote:
On 18/03/2006, at 1:31 AM, Matthew Cooper wrote:
Hi Ed,
Thanks again for working on this. I can try and work on it a bit this weekend. I've had time to look over the two example scripts you provided. There seemed to be some difference in the two in terms of the call to the conditionalmodel fit method. In the low level example, the count parameter seemed to provide the empirical counts of the feature functions, where the features were simply (context,label) co-occurrence. In the high level example, the features are more complicated, and the counts parameter seems to have different dimensionality. I'll try and get a working high level example together next.
Hi Matt,
I've now found and fixed some bugs in the conditional maxent code. The computation of the conditional expectations was wrong, and the p_tilde parameter was interpreted inconsistently. Both the examples work now! Fantastic!
I'd be very grateful for any assistance you could give in providing more examples -- especially real examples from text classification. The two examples at the moment are too artificial and perhaps a bit confusing. Or if you have any suggestions or patches for simplifying the interface (e.g. the constructor arguments) or any other improvements (e.g. bug fixes, better docs, or a tutorial) I'd also readily merge them.
Let me know how you go with it. When you're happy that it's all working, I'll merge it with the main SVN trunk.
-- Ed
On 21/03/2006, at 9:53 PM, Matthew Cooper wrote:
Hi Ed,
I am playing around with the code on some more small examples and everything has been fine. The thing that will hold me back from testing on larger datasets is the F matrix which thus far requires the space of (context,label) pairs to be enumerable. I know that internally you are using a sparse representation for this matrix. Can I initialize the model with a sparse matrix also? This also requires changes with the indices_context parameter in the examples.
Hi Matt, Yes, good point. I'd conveniently forgotten about this little problem ;) It turns out scipy's sparse matrices need extending to support this. I've made some changes already (to the ejs branch); the next requirement is more flexible slicing support. I added partial slicing support (for slicing an entire row of a lil_matrix) a couple of months ago, but this isn't good enough here, although it shouldn't be too hard to extend. One upside of using slicing, rather than fancy indexing as before (which some of scipy's sparse matrix formats do already support), is that the indices_context parameter can then go away completely; we'll just expect the features indices to be ordered contiguously, which I think is perfectly reasonable here. I've checked in my latest code (into the ejs branch) in case you want to follow my progress or work on it yourself. But the conditional maxent examples no longer work, so avoid doing 'svn update' if you want to keep the working version for now...
I see that you also have an unconditional bigmodel class that seems related, but I'm not sure what would need to be changed.
Actually, the definition of 'big' here is 'requires Monte Carlo simulation' -- for example, continuous models in many dimensions or models on very large discrete spaces, such as the space of all possible sentences. I'll give some more thought to the rest of your post and get back to you in a few more days... -- Ed
On 23/03/2006, at 6:29 PM, Ed Schofield wrote:
On 21/03/2006, at 9:53 PM, Matthew Cooper wrote:
Hi Ed,
I am playing around with the code on some more small examples and everything has been fine. The thing that will hold me back from testing on larger datasets is the F matrix which thus far requires the space of (context,label) pairs to be enumerable. I know that internally you are using a sparse representation for this matrix. Can I initialize the model with a sparse matrix also? This also requires changes with the indices_context parameter in the examples.
Hi Matt, Yes, good point. I'd conveniently forgotten about this little problem ;) It turns out scipy's sparse matrices need extending to support this.
And done. I've committed the new sparse matrix features to the ejs branch and fixed conditional maxent models to work with them. The examples seem to work fine too. Please let me know how you go with it! -- Ed
Ed, I apologize for not looking at this sooner. I went through the new conditionalexample_high_level.py and I still think there a small change that needs to be made (I think it's small anyway). I think that we want F to be the size F = sparse.lil_matrix((len(f), numcorpus*numsamplespace)) where numcorpus = len(corpus) basically, the space over which we evaluate each feature to compute expectations under the current model is (X*N) where X is the size of the samplespace (number of classes) and N is the number of labeled training observations. for a feature f_i the expected value under the model is <f_i>_{theta} = \sum_{n=1}^N \frac{1}{N} \sum_{x in samplespace} P_{\theta}(x|w_n) f_i(x,w_n) so that for each function, we need a look up table that covers all pairs of x from the samplespace and w_n from the training set. The first sum is simply the empirical context distribution which is uniform over the training set. The model distribution is only defined conditioning on contexts from the training set. This equation replaces an exponentially large space of contexts with only the N contexts from the training set. I don't think this alters your code, as long as the pmf and F matrices are initialized correctly. At test time, we do need to evaluate the feature functions on unseen documents, but this can be handled more easily. I have another question. I haven't installed your version of scipy outright since it was a bit of a pain to get the current stable distribution up on my machine. However, if I need to load a bunch of modules from your version to test the conditional models is there an easy way to do that? At the moment, I couldn't import sparseutils (I can't find the .py file since I probably haven't built it?). Thanks, Matt On 3/26/06, Ed Schofield <schofield@ftw.at> wrote:
On 23/03/2006, at 6:29 PM, Ed Schofield wrote:
On 21/03/2006, at 9:53 PM, Matthew Cooper wrote:
Hi Ed,
I am playing around with the code on some more small examples and everything has been fine. The thing that will hold me back from testing on larger datasets is the F matrix which thus far requires the space of (context,label) pairs to be enumerable. I know that internally you are using a sparse representation for this matrix. Can I initialize the model with a sparse matrix also? This also requires changes with the indices_context parameter in the examples.
Hi Matt, Yes, good point. I'd conveniently forgotten about this little problem ;) It turns out scipy's sparse matrices need extending to support this.
And done. I've committed the new sparse matrix features to the ejs branch and fixed conditional maxent models to work with them. The examples seem to work fine too. Please let me know how you go with it!
-- Ed
On 31/03/2006, at 3:55 AM, Matthew Cooper wrote:
I went through the new conditionalexample_high_level.py and I still think there a small change that needs to be made (I think it's small anyway). I think that we want F to be the size
F = sparse.lil_matrix((len(f), numcorpus*numsamplespace))
where numcorpus = len(corpus)
Okay, this seems straightforward. I've changed the example so there are only columns of F for contexts that appear in the corpus.
I don't think this alters your code, as long as the pmf and F matrices are initialized correctly.
Yes, you're right.
At test time, we do need to evaluate the feature functions on unseen documents, but this can be handled more easily.
I'm not sure how yet. I'll give this some thought.
I have another question. I haven't installed your version of scipy outright since it was a bit of a pain to get the current stable distribution up on my machine. However, if I need to load a bunch of modules from your version to test the conditional models is there an easy way to do that?
Which scipy version are you using? If it's recent enough, you can just copy my maxentropy.py and sparse.py files over the installed ones. I'm happy enough that it works now; I've merged the new sparse functionality back into the trunk, and I'll do the same with conditional maxent class in the next few days.
At the moment, I couldn't import sparseutils (I can't find the .py file since I probably haven't built it?).
sparsetools is written in FORTRAN, with an f2py interface, so it needs to be installed properly by numpy.distutils. But sparsetools is the same in my branch as in the trunk ... -- Ed
participants (2)
-
Ed Schofield -
Matthew Cooper