A proof of concept to replace unreliable python chardet

There is a very old *issue* regarding "encoding detection" in a text file that has been partially resolved by a program like Chardet <https://github.com/chardet/chardet>. I did not like the idea of single prober per encoding table that could lead to hard coding specifications. I wanted to challenge the existing methods of discovering originating encoding. You could consider this issue as obsolete because of current norms : You should indicate used charset encoding as described in standards But the reality is different, a huge part of the internet still have content with an unknown encoding. (*One could point out subrip subtitle (SRT) for instance*) This is why a popular package like Requests <https://github.com/psf/Requests> embed Chardet to guess apparent encoding on remote resources. https://github.com/Ousret/charset_normalizer https://pypi.org/project/charset-normalizer Charset Normalizer <https://github.com/Ousret/charset_normalizer>. *A first PoC. Currently at version 0.3* The Real First Universal Charset Detector. No Cpp Bindings. (13-Sept-19) *LICENSE MIT* ahmed.tahri@cloudnursery.dev
participants (1)
-
Ahmed TAHRI