[scikit-learn] Mapping fulltext OCR to issue type

Thomas Güttler guettliml at thomas-guettler.de
Mon Jun 18 06:16:19 EDT 2018


Thank you very much David,

I ordered the book

Regards,
   Thomas

Am 13.06.2018 um 12:25 schrieb David Asfaha:
> 
> Hi,
> 
> I would recommend starting with Naive Bayes [1] to classify the issues by parent issue type. To check how that works 
> learn about F1 accuracy scores [2] and use them. If you are happy with the results, and depending on how much data you 
> have, try to modify the Naive Bayes classifier to predict the specific issue type. From here there are many more things 
> to do, like using an ensemble of classifiers, experimenting with SVMs, random forrest, TFIDF, n-grams...
> 
> Natural Language Processing with Python is a good book on NLP , also Andrew Ng's Machine Learning course on coursera if 
> you're new to the subject.
> 
> Hope this helps.
> 
> David
> 
> 
> [1] http://scikit-learn.org/stable/modules/naive_bayes.html
> [2] http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html
> 
> 
> On 13 June 2018 at 10:43, Thomas Güttler <guettliml at thomas-guettler.de <mailto:guettliml at thomas-guettler.de>> wrote:
> 
>     I am still willing to learn.
> 
>     Does anyone have a recommendation which book or website could help me?
> 
>     Regards,
>        Thomas
> 
> 
>     Am 08.06.2018 um 10:48 schrieb Thomas Güttler:
> 
>         We run an issue tracking application. A lot of issues get generated
>         from scanned letters.
> 
>         I have 70k full text OCR result files. Their got created with tesseract.
> 
>         Every file of these 70k files corresponds to a issue. Each issue has an issue type.
> 
>         I want to use machine learning and in the future the machine
>         should be able to guess the issue type by looking at the full text OCR.
> 
>         The issue types are not a simple list, it is a tree.
> 
>         Example:
> 
>         electricity / power grid
>         electricity / outages
>         customer support / invoices / complaint
>         customer support / invoices / tax
>         ....
> 
> 
>         If the machine can't guess
> 
>              "customer support / invoices / complaint"
> 
>         it would be nice if it could at least guess roughly the parent issue type:
> 
>              "customer support / invoices"
> 
>         I never used sciki before, but I use Python since several years.
> 
>         Could you please guide me to the right direction?
> 
>         Regards,
>             Thomas Güttler
> 
> 
> 
>     -- 
>     Thomas Guettler http://www.thomas-guettler.de/
>     I am looking for feedback: https://github.com/guettli/programming-guidelines
>     <https://github.com/guettli/programming-guidelines>
>     _______________________________________________
>     scikit-learn mailing list
>     scikit-learn at python.org <mailto:scikit-learn at python.org>
>     https://mail.python.org/mailman/listinfo/scikit-learn <https://mail.python.org/mailman/listinfo/scikit-learn>
> 
> 
> 
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> 

-- 
Thomas Guettler http://www.thomas-guettler.de/
I am looking for feedback: https://github.com/guettli/programming-guidelines


More information about the scikit-learn mailing list