[scikit-learn] Mapping fulltext OCR to issue type
Thomas Güttler
guettliml at thomas-guettler.de
Mon Jun 18 06:16:19 EDT 2018
Thank you very much David,
I ordered the book
Regards,
Thomas
Am 13.06.2018 um 12:25 schrieb David Asfaha:
>
> Hi,
>
> I would recommend starting with Naive Bayes [1] to classify the issues by parent issue type. To check how that works
> learn about F1 accuracy scores [2] and use them. If you are happy with the results, and depending on how much data you
> have, try to modify the Naive Bayes classifier to predict the specific issue type. From here there are many more things
> to do, like using an ensemble of classifiers, experimenting with SVMs, random forrest, TFIDF, n-grams...
>
> Natural Language Processing with Python is a good book on NLP , also Andrew Ng's Machine Learning course on coursera if
> you're new to the subject.
>
> Hope this helps.
>
> David
>
>
> [1] http://scikit-learn.org/stable/modules/naive_bayes.html
> [2] http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html
>
>
> On 13 June 2018 at 10:43, Thomas Güttler <guettliml at thomas-guettler.de <mailto:guettliml at thomas-guettler.de>> wrote:
>
> I am still willing to learn.
>
> Does anyone have a recommendation which book or website could help me?
>
> Regards,
> Thomas
>
>
> Am 08.06.2018 um 10:48 schrieb Thomas Güttler:
>
> We run an issue tracking application. A lot of issues get generated
> from scanned letters.
>
> I have 70k full text OCR result files. Their got created with tesseract.
>
> Every file of these 70k files corresponds to a issue. Each issue has an issue type.
>
> I want to use machine learning and in the future the machine
> should be able to guess the issue type by looking at the full text OCR.
>
> The issue types are not a simple list, it is a tree.
>
> Example:
>
> electricity / power grid
> electricity / outages
> customer support / invoices / complaint
> customer support / invoices / tax
> ....
>
>
> If the machine can't guess
>
> "customer support / invoices / complaint"
>
> it would be nice if it could at least guess roughly the parent issue type:
>
> "customer support / invoices"
>
> I never used sciki before, but I use Python since several years.
>
> Could you please guide me to the right direction?
>
> Regards,
> Thomas Güttler
>
>
>
> --
> Thomas Guettler http://www.thomas-guettler.de/
> I am looking for feedback: https://github.com/guettli/programming-guidelines
> <https://github.com/guettli/programming-guidelines>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org <mailto:scikit-learn at python.org>
> https://mail.python.org/mailman/listinfo/scikit-learn <https://mail.python.org/mailman/listinfo/scikit-learn>
>
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
--
Thomas Guettler http://www.thomas-guettler.de/
I am looking for feedback: https://github.com/guettli/programming-guidelines
More information about the scikit-learn
mailing list