[scikit-learn] Mapping fulltext OCR to issue type

Fri Jun 8 04:48:59 EDT 2018

We run an issue tracking application. A lot of issues get generated
from scanned letters.

I have 70k full text OCR result files. Their got created with tesseract.

Every file of these 70k files corresponds to a issue. Each issue has an issue type.

I want to use machine learning and in the future the machine
should be able to guess the issue type by looking at the full text OCR.

The issue types are not a simple list, it is a tree.

Example:

electricity / power grid
electricity / outages
customer support / invoices / complaint
customer support / invoices / tax
....

If the machine can't guess

    "customer support / invoices / complaint"

it would be nice if it could at least guess roughly the parent issue type:

    "customer support / invoices"

I never used sciki before, but I use Python since several years.

Could you please guide me to the right direction?

Regards,
   Thomas Güttler

-- 
Thomas Guettler http://www.thomas-guettler.de/
I am looking for feedback: https://github.com/guettli/programming-guidelines