Python library to break text into words
Abdur-Rahmaan Janhangeer
arj.python at gmail.com
Thu May 31 23:29:27 EDT 2018
1-> search in dict, identify all words example :
meaningsofoffers
.. identified words :
me
an
mean
in
meaning
meanings
so
of
of
offer
offers
2-> next filter duplicates, i.e. of above in a new list as the original
list serves as chronological reference
3-> next chose the words whose lengths make up the length of the string
4-> if several solutions choose non-overlapping and chronologically sound
ones
5-> unused letters are treated as words where non-natural words are
included, that can be problematic if sub words are found in it and point 7
might be the way to go
6-> in the case of non-regular words included, the program returns the best
solutions for the user to choose from
i have branded the above 6 points algorithm as the Arj.mu Algorithm of Word
Extraction in Connected Letters
7-> if machine learning is enacted, the above point (6) serves as training
(on an everyday usage app) or it can directly train on predefined examples
8-> if typos are assumed to be found titles, then the title should be
assumed to have the corrected words and a new search is done on this
assumed title. in which case the results are added to the non corrected
version and then point 6 above is executed
8.1-> for assumptions in 8, Natural Language modules might be used
9-> titles can contain numbers, dates, author names and others and as such
is not covered by the points above
Abdur-Rahmaan Janhangeer
https://github.com/Abdur-rahmaanJ
More information about the Python-list
mailing list