[scikit-learn] Mapping fulltext OCR to issue type

Thomas Güttler guettliml at thomas-guettler.de
Wed Jun 13 05:43:55 EDT 2018


I am still willing to learn.

Does anyone have a recommendation which book or website could help me?

Regards,
   Thomas

Am 08.06.2018 um 10:48 schrieb Thomas Güttler:
> We run an issue tracking application. A lot of issues get generated
> from scanned letters.
> 
> I have 70k full text OCR result files. Their got created with tesseract.
> 
> Every file of these 70k files corresponds to a issue. Each issue has an issue type.
> 
> I want to use machine learning and in the future the machine
> should be able to guess the issue type by looking at the full text OCR.
> 
> The issue types are not a simple list, it is a tree.
> 
> Example:
> 
> electricity / power grid
> electricity / outages
> customer support / invoices / complaint
> customer support / invoices / tax
> ....
> 
> 
> If the machine can't guess
> 
>     "customer support / invoices / complaint"
> 
> it would be nice if it could at least guess roughly the parent issue type:
> 
>     "customer support / invoices"
> 
> I never used sciki before, but I use Python since several years.
> 
> Could you please guide me to the right direction?
> 
> Regards,
>    Thomas Güttler
> 
> 

-- 
Thomas Guettler http://www.thomas-guettler.de/
I am looking for feedback: https://github.com/guettli/programming-guidelines


More information about the scikit-learn mailing list