Looking for Python tool to calculate string similarity
Hi folks, I am looking for some tool (or algorithm which I can implement at the worst) which calculates similarities between strings. I would turn this into a pylint plugin b/c this is how I would consume it in my projects. My background is that we've identified *duplicate* or *similar* strings in our project which are marked for translation. Some of these are upper case vs. lower case and all of the variations between (I can lower case everything before sending to the tool of course), variances in spelling, e.g. "test case" vs "TestCase", variations into how certain words/combination of words are used together in a sentence, e.g. "user does not exist" vs. "the user specified was not found". Ideally I'd like to consume this tool in CI and based on the results reduce the number of source strings needed for translation and make life for translators easier. Feel free to propose anything, I have not done any research on this topic. Thanks, Alex
You're looking for https://en.m.wikipedia.org/wiki/Levenshtein_distance , there's a python module that implements this already (actually an extension, for speed) On Sun, Sep 15, 2019, 16:26 Alexander Todorov <atodorov@mrsenko.com> wrote:
Hi folks, I am looking for some tool (or algorithm which I can implement at the worst) which calculates similarities between strings. I would turn this into a pylint plugin b/c this is how I would consume it in my projects.
My background is that we've identified *duplicate* or *similar* strings in our project which are marked for translation. Some of these are upper case vs. lower case and all of the variations between (I can lower case everything before sending to the tool of course), variances in spelling, e.g. "test case" vs "TestCase", variations into how certain words/combination of words are used together in a sentence, e.g. "user does not exist" vs. "the user specified was not found".
Ideally I'd like to consume this tool in CI and based on the results reduce the number of source strings needed for translation and make life for translators easier.
Feel free to propose anything, I have not done any research on this topic.
Thanks, Alex _______________________________________________ code-quality mailing list -- code-quality@python.org To unsubscribe send an email to code-quality-leave@python.org https://mail.python.org/mailman3/lists/code-quality.python.org/
I'd guess what you're looking for is Levenstein distance. On 15/09/2019 22:26:56, Alexander Todorov <atodorov@mrsenko.com> wrote: Hi folks, I am looking for some tool (or algorithm which I can implement at the worst) which calculates similarities between strings. I would turn this into a pylint plugin b/c this is how I would consume it in my projects. My background is that we've identified *duplicate* or *similar* strings in our project which are marked for translation. Some of these are upper case vs. lower case and all of the variations between (I can lower case everything before sending to the tool of course), variances in spelling, e.g. "test case" vs "TestCase", variations into how certain words/combination of words are used together in a sentence, e.g. "user does not exist" vs. "the user specified was not found". Ideally I'd like to consume this tool in CI and based on the results reduce the number of source strings needed for translation and make life for translators easier. Feel free to propose anything, I have not done any research on this topic. Thanks, Alex _______________________________________________ code-quality mailing list -- code-quality@python.org To unsubscribe send an email to code-quality-leave@python.org https://mail.python.org/mailman3/lists/code-quality.python.org/
On Sun, Sep 15, 2019 at 11:26:02PM +0300, Alexander Todorov wrote:
Hi folks, I am looking for some tool (or algorithm which I can implement at the worst) which calculates similarities between strings.
There are many such algorithms, starting with the venerable Soundex algorithm, the well-known Levenshtein distance, and many more.
Feel free to propose anything, I have not done any research on this topic.
I propose that next time you do your research first. Here's a starting point for you: https://duckduckgo.com/?q=string+similarity -- Steven
On Sun, Sep 15, 2019 at 11:26:02PM +0300, Alexander Todorov wrote:
Hi folks, I am looking for some tool (or algorithm which I can implement at the worst) which calculates similarities between strings.
There are many such algorithms, starting with the venerable Soundex algorithm, the well-known Levenshtein distance, and many more.
Feel free to propose anything, I have not done any research on this topic.
I propose that next time you do your research first. Here's a starting point for you: https://duckduckgo.com/?q=string+similarity -- Steven
Hey Steve, Please be respectful to other members in this mailing list. https://www.python.org/psf/codeofconduct/ On Sun, Sep 15, 2019, 19:28 Steven D'Aprano <steve+python@pearwood.info> wrote:
On Sun, Sep 15, 2019 at 11:26:02PM +0300, Alexander Todorov wrote:
Hi folks, I am looking for some tool (or algorithm which I can implement at the worst) which calculates similarities between strings.
There are many such algorithms, starting with the venerable Soundex algorithm, the well-known Levenshtein distance, and many more.
Feel free to propose anything, I have not done any research on this topic.
I propose that next time you do your research first. Here's a starting point for you:
https://duckduckgo.com/?q=string+similarity
-- Steven _______________________________________________ code-quality mailing list -- code-quality@python.org To unsubscribe send an email to code-quality-leave@python.org https://mail.python.org/mailman3/lists/code-quality.python.org/
What? What's disrespectful about what Steve said? He's right everyone should spend some of their time first before asking for someone else to invest their (free) time to answer a question that's easily answered via a Google search. I very hope coc is not slapped in devs faces everything someone doesn't adhere to your pwrsonal way of composing emails On Sun, Sep 15, 2019, 19:32 Ahmed Hassan <ahassan@rapidsos.com> wrote:
Hey Steve,
Please be respectful to other members in this mailing list.
https://www.python.org/psf/codeofconduct/
On Sun, Sep 15, 2019, 19:28 Steven D'Aprano <steve+python@pearwood.info> wrote:
On Sun, Sep 15, 2019 at 11:26:02PM +0300, Alexander Todorov wrote:
Hi folks, I am looking for some tool (or algorithm which I can implement at the worst) which calculates similarities between strings.
There are many such algorithms, starting with the venerable Soundex algorithm, the well-known Levenshtein distance, and many more.
Feel free to propose anything, I have not done any research on this topic.
I propose that next time you do your research first. Here's a starting point for you:
https://duckduckgo.com/?q=string+similarity
-- Steven _______________________________________________ code-quality mailing list -- code-quality@python.org To unsubscribe send an email to code-quality-leave@python.org https://mail.python.org/mailman3/lists/code-quality.python.org/
_______________________________________________ code-quality mailing list -- code-quality@python.org To unsubscribe send an email to code-quality-leave@python.org https://mail.python.org/mailman3/lists/code-quality.python.org/
On 9/15/19 5:47 PM, Ashley Whetter wrote:
No matter what's being said, it can always be worded respectfully. If someone came and asked this question in person, would you sarcastically tell them to try googling it? Probably not. I think we're all friendlier than that.
I don't know about others, but when I run into the same situation, instead of saying "just Google it" I provide a few links to get the asker a start. Been doing that, in one form or another, in one medium or another, for my entire 46-year career. Back in the beginning, before the Internet, before ARPAnet, I would provide a few references for the person on a quest to look up in the library. Working on the theory that "give a person a fish, they eat for a day; teach them to fish and they eat forever." If someone can't play nice, they should not play at all. We are all in this together, from retiree to greenhorn.
The scipy package has several implementations of "distances". all of them are useful in one way to another to make these comparisons. Another package (related only for this topic) is https://pypi.org/project/textdistance/ I hope my contribution is not buried under a lot of "off-topics" El dom., 15 sept. 2019 a las 23:13, Stephen Satchell (<list@satchell.net>) escribió:
No matter what's being said, it can always be worded respectfully. If someone came and asked this question in person, would you sarcastically tell
On 9/15/19 5:47 PM, Ashley Whetter wrote: them to try
googling it? Probably not. I think we're all friendlier than that.
I don't know about others, but when I run into the same situation, instead of saying "just Google it" I provide a few links to get the asker a start. Been doing that, in one form or another, in one medium or another, for my entire 46-year career.
Back in the beginning, before the Internet, before ARPAnet, I would provide a few references for the person on a quest to look up in the library. Working on the theory that "give a person a fish, they eat for a day; teach them to fish and they eat forever."
If someone can't play nice, they should not play at all. We are all in this together, from retiree to greenhorn. _______________________________________________ code-quality mailing list -- code-quality@python.org To unsubscribe send an email to code-quality-leave@python.org https://mail.python.org/mailman3/lists/code-quality.python.org/
-- Juan B Cabral
On Sun, Sep 15, 2019 at 06:17:04PM -0700, Stephen Satchell wrote:
I don't know about others, but when I run into the same situation, instead of saying "just Google it" I provide a few links to get the asker a start.
Who said "just Google it"?
Back in the beginning, before the Internet, before ARPAnet, I would provide a few references for the person on a quest to look up in the library. Working on the theory that "give a person a fish, they eat for a day; teach them to fish and they eat forever."
Something like mentioning not one but two string similarity algorithms by name, stating that there are many others, and providing a link to get started? What a good idea, if only I had thought of doing something like that... -- Steven
This will be my last comment on this topic. On Sun, Sep 15, 2019 at 05:47:17PM -0700, Ashley Whetter wrote:
No matter what's being said, it can always be worded respectfully. If someone came and asked this question in person, would you sarcastically tell them to try googling it? Probably not.
There was nothing sarcastic in my response. I meant every word of it exactly as I said it. I answered Alexander's question about string similarity, and gave a technically better answer than those who just named Levenshtein distance as if it were the only option. And I didn't *just* answer his question, I gently gave a mildly-worded answer: do your research first next time. I didn't insult him, or call him names, or even call him out for disrespecting everyone's time. It surely took Alexander much longer to type up his email and send it to the list than it would have taken him to google "string similarity". It is not just for our benefit, but his too, that he should do his own basic research before asking questions. -- Steven
frequently-used algorithms for string edit distance: Levenshtein & Damerau Levenshtein distance Jaro & Jaro-Winkler distance N-Gram distance -- Rain Chen Sent with Airmail On September 16, 2019 at 4:26:53 AM, Todorov Alexander (atodorov@mrsenko.com) wrote: Hi folks, I am looking for some tool (or algorithm which I can implement at the worst) which calculates similarities between strings. I would turn this into a pylint plugin b/c this is how I would consume it in my projects. My background is that we've identified *duplicate* or *similar* strings in our project which are marked for translation. Some of these are upper case vs. lower case and all of the variations between (I can lower case everything before sending to the tool of course), variances in spelling, e.g. "test case" vs "TestCase", variations into how certain words/combination of words are used together in a sentence, e.g. "user does not exist" vs. "the user specified was not found". Ideally I'd like to consume this tool in CI and based on the results reduce the number of source strings needed for translation and make life for translators easier. Feel free to propose anything, I have not done any research on this topic. Thanks, Alex _______________________________________________ code-quality mailing list -- code-quality@python.org To unsubscribe send an email to code-quality-leave@python.org https://mail.python.org/mailman3/lists/code-quality.python.org/
Thank you to everyone who replied. I will check out all of the suggestions and post after some time with my findings or directly a linter plugin ready for use. -- Alex
participants (10)
-
Ahmed Hassan
-
Alexander Todorov
-
Ashley Whetter
-
Carl Crowder
-
Juan BC
-
Rain Chen
-
Sandro Tosi
-
Stephen Satchell
-
Steven D'Aprano
-
Steven D'Aprano