We run an issue tracking application. A lot of issues get generated from scanned letters. I have 70k full text OCR result files. Their got created with tesseract. Every file of these 70k files corresponds to a issue. Each issue has an issue type. I want to use machine learning and in the future the machine should be able to guess the issue type by looking at the full text OCR. The issue types are not a simple list, it is a tree. Example: electricity / power grid electricity / outages customer support / invoices / complaint customer support / invoices / tax .... If the machine can't guess "customer support / invoices / complaint" it would be nice if it could at least guess roughly the parent issue type: "customer support / invoices" I never used sciki before, but I use Python since several years. Could you please guide me to the right direction? Regards, Thomas Güttler -- Thomas Guettler http://www.thomas-guettler.de/ I am looking for feedback: https://github.com/guettli/programming-guidelines
I am still willing to learn. Does anyone have a recommendation which book or website could help me? Regards, Thomas Am 08.06.2018 um 10:48 schrieb Thomas Güttler:
We run an issue tracking application. A lot of issues get generated from scanned letters.
I have 70k full text OCR result files. Their got created with tesseract.
Every file of these 70k files corresponds to a issue. Each issue has an issue type.
I want to use machine learning and in the future the machine should be able to guess the issue type by looking at the full text OCR.
The issue types are not a simple list, it is a tree.
Example:
electricity / power grid electricity / outages customer support / invoices / complaint customer support / invoices / tax ....
If the machine can't guess
"customer support / invoices / complaint"
it would be nice if it could at least guess roughly the parent issue type:
"customer support / invoices"
I never used sciki before, but I use Python since several years.
Could you please guide me to the right direction?
Regards, Thomas Güttler
-- Thomas Guettler http://www.thomas-guettler.de/ I am looking for feedback: https://github.com/guettli/programming-guidelines
Hi, I would recommend starting with Naive Bayes [1] to classify the issues by parent issue type. To check how that works learn about F1 accuracy scores [2] and use them. If you are happy with the results, and depending on how much data you have, try to modify the Naive Bayes classifier to predict the specific issue type. From here there are many more things to do, like using an ensemble of classifiers, experimenting with SVMs, random forrest, TFIDF, n-grams... Natural Language Processing with Python is a good book on NLP , also Andrew Ng's Machine Learning course on coursera if you're new to the subject. Hope this helps. David [1] http://scikit-learn.org/stable/modules/naive_bayes.html [2] http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.ht... On 13 June 2018 at 10:43, Thomas Güttler <guettliml@thomas-guettler.de> wrote:
I am still willing to learn.
Does anyone have a recommendation which book or website could help me?
Regards, Thomas
Am 08.06.2018 um 10:48 schrieb Thomas Güttler:
We run an issue tracking application. A lot of issues get generated from scanned letters.
I have 70k full text OCR result files. Their got created with tesseract.
Every file of these 70k files corresponds to a issue. Each issue has an issue type.
I want to use machine learning and in the future the machine should be able to guess the issue type by looking at the full text OCR.
The issue types are not a simple list, it is a tree.
Example:
electricity / power grid electricity / outages customer support / invoices / complaint customer support / invoices / tax ....
If the machine can't guess
"customer support / invoices / complaint"
it would be nice if it could at least guess roughly the parent issue type:
"customer support / invoices"
I never used sciki before, but I use Python since several years.
Could you please guide me to the right direction?
Regards, Thomas Güttler
-- Thomas Guettler http://www.thomas-guettler.de/ I am looking for feedback: https://github.com/guettli/pro gramming-guidelines _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Thank you very much David, I ordered the book Regards, Thomas Am 13.06.2018 um 12:25 schrieb David Asfaha:
Hi,
I would recommend starting with Naive Bayes [1] to classify the issues by parent issue type. To check how that works learn about F1 accuracy scores [2] and use them. If you are happy with the results, and depending on how much data you have, try to modify the Naive Bayes classifier to predict the specific issue type. From here there are many more things to do, like using an ensemble of classifiers, experimenting with SVMs, random forrest, TFIDF, n-grams...
Natural Language Processing with Python is a good book on NLP , also Andrew Ng's Machine Learning course on coursera if you're new to the subject.
Hope this helps.
David
[1] http://scikit-learn.org/stable/modules/naive_bayes.html [2] http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.ht...
On 13 June 2018 at 10:43, Thomas Güttler <guettliml@thomas-guettler.de <mailto:guettliml@thomas-guettler.de>> wrote:
I am still willing to learn.
Does anyone have a recommendation which book or website could help me?
Regards, Thomas
Am 08.06.2018 um 10:48 schrieb Thomas Güttler:
We run an issue tracking application. A lot of issues get generated from scanned letters.
I have 70k full text OCR result files. Their got created with tesseract.
Every file of these 70k files corresponds to a issue. Each issue has an issue type.
I want to use machine learning and in the future the machine should be able to guess the issue type by looking at the full text OCR.
The issue types are not a simple list, it is a tree.
Example:
electricity / power grid electricity / outages customer support / invoices / complaint customer support / invoices / tax ....
If the machine can't guess
"customer support / invoices / complaint"
it would be nice if it could at least guess roughly the parent issue type:
"customer support / invoices"
I never used sciki before, but I use Python since several years.
Could you please guide me to the right direction?
Regards, Thomas Güttler
-- Thomas Guettler http://www.thomas-guettler.de/ I am looking for feedback: https://github.com/guettli/programming-guidelines <https://github.com/guettli/programming-guidelines> _______________________________________________ scikit-learn mailing list scikit-learn@python.org <mailto:scikit-learn@python.org> https://mail.python.org/mailman/listinfo/scikit-learn <https://mail.python.org/mailman/listinfo/scikit-learn>
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
-- Thomas Guettler http://www.thomas-guettler.de/ I am looking for feedback: https://github.com/guettli/programming-guidelines
participants (2)
-
David Asfaha -
Thomas Güttler