[scikit-learn] How to answer questions from big documents?
Rodrigo Rosenfeld Rosas
rr.rosas at gmail.com
Wed Apr 3 14:38:37 EDT 2019
Hi everyone, this is my first post here :)
About two weeks ago, due to the low demand in my project, I have been
assigned a completely unusual request: to automatically extract answers
from documents based on machine learning. I've never read anything about
ML, AI or NLP before, so I've been basically doing just that for the past
two weeks.
When it comes to ML, most book recommendations and tutorials I've found so
far use the Python language and tools, so I took the first week to learn
about Python, NumPy, Scikit, Panda, Matplotlib and so on. Then, this week I
started reading about NLP itself, after spending a few days reading about
generic ML algorithms.
So far, I've basically read about Bag of Words, using TF-IDF (or simply
terms count) to convert the words to numeric representations and a few
methods such as the gaussian and multinomial naive bayes methods to train
and predict values. The methods also mention the importance of using the
usual pre-processing methods such as lemmatization and alikes. However,
basically all examples assume that a given text can be classified in one of
the categorized topics, like the sentiment analysis use case. I'm afraid
this doesn't represent my use case, so I'd like to describe it here so that
you could help me identifying which methods I should be looking for.
We have a system with thousands of transactions/deals inputted manually by
an specialized team. Each deal has a set of documents (a dozen per deal
typically) and some documents could have hundreds of pages. The inputing
team has to extract about a thousand fields from those documents for any
particular deal. So, in our database we have all their data and we
typically also know the document specific snippets associated to each field
value.
So, my task is to, given a new document and deal, and based on the previous
answers, fill in as many fields as I could by automatically finding the
corresponding snippets in the new documents. I'm not sure how I should
approach this problem.
For example, I could consider each sentence of the document as a separate
document to be analyzed and compared to the snippets I already have for the
matching data. However, I can't be sure whether some of those sentences
would actually answer the question. For example, maybe there are 6
occurrences in the documents that would answer a particular question/field,
but maybe the inputters only identified 2 or 3 of them.
Also, for any given sentence, it could tell that the answer for a given
field is A or B, or it could be that there's absolutely no association
between the sentence and the field/question, as it would be the case for
most sentences. I know that Scikit provides the predict_proba method, so
that I could try to only consider the sentence as relevant if the
probabilities of answering the question would be above 80%, for example,
but based on a few quick tests I've made with a few sentences and words, I
suspect this won't work very well. Also, it could be quite slow to treat
each sentence of a 500-hundreds of pages documents as a separate document
to be analyzed, so I'm not sure if there are better methods to handle this
use case.
Some of the fields are free-text ones, like company and firm names, for
example, and I suspect those would be the hardest to answer, so I'm trying
to start with the multiple-choice ones, with a finite set of classification.
How would you advise me to look at this problem? Are there any algorithms
you'd recommend me to study for solving this particular problem?
Here are some sample data so that you could get a better understanding of
the problem:
One of the fields is called "Deal Structure" and it could have the
following values: "Asset Purchase", "Stock or Equity Purchase" or "Public
Target Merger" (there are a few others, but this gives you an idea).
So, here are some sentences highlighted for Public Target Merger deals
(those documents come from Edgar Filings public database which are freely
available for US deals):
deal 1 / doc 1: "AGREEMENT AND PLAN OF MERGER, dated as of March 14, 2018
(this “Agreement”), by and among HarborOne Bancorp, Inc., a Massachusetts
corporation (“Buyer”), Massachusetts Acquisitions, LLC, a Maryland limited
liability company of which Buyer is the sole member (“Merger LLC”), and
Coastway Bancorp, Inc., a Maryland corporation (the “Company”)."
"WHEREAS, Buyer, Merger LLC, and the Company intend to effect a merger (the
“Merger”) of Merger LLC with and into the Company in accordance with this
Agreement and the Maryland General Corporation Law (the “MGCL”) and the
Maryland Limited Liability Company Act, as amended (the “MLLCA”), with the
Company to be the surviving entity in the Merger. The Merger will be
followed immediately by a merger of the Company with and into Buyer (the
“Upstream Merger”), with the Buyer to be the surviving entity in the
Upstream Merger. It is intended that the Merger be mutually interdependent
with and a condition precedent to the Upstream Merger and that the Upstream
Merger shall, through the binding commitment evidenced by this Agreement,
be effected immediately following the Effective Time (as defined below)
without further approval, authorization or direction from or by any of the
parties hereto; and"
deal 2 / doc 1:
"WHEREAS, it is also proposed that, as soon as practicable following the
consummation of the Offer, the Parties wish to effect the acquisition of
the Company by Parent through the merger of Purchaser with and into the
Company, with the Company being the surviving entity (the “Merger”);"
Now, for Asset Purchase deals:
deal 3 / doc 1:
"Subject to the terms and conditions of this Agreement, Sellers are willing
to sell to Buyer, and Buyer is willing to purchase from Sellers, all of
their assets relating to the Businesses as set forth herein."
deal 4 / doc 1:
"WHEREAS, Seller wishes to sell and assign to Buyer, and Buyer wishes to
purchase and assume from Seller, the rights and obligations of Seller to
the Purchased Assets (as defined herein), subject to the terms and
conditions set forth herein."
Please forgive me for any imprecise/incorrect terms or understanding on
this topic as this is all very new to me. Any help is very appreciated.
I've also asked this question in StackOverflow, so if you'd prefer to
answer there instead, here is the link:
https://stackoverflow.com/questions/55499866/how-to-answer-questions-from-big-documents
Would this field be called data mining? Feature extraction? Question
answering? I'm not sure how to properly search about this subject so any
hints are very welcome :)
Thanks in advance,
Rodrigo.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190403/6643500f/attachment-0001.html>
More information about the scikit-learn
mailing list