Mailman 3 A proof of concept to replace unreliable python chardet - Python-announce-list

Sept. 13, 2019

      There is a very old *issue* regarding "encoding detection" in a text file
that has been partially resolved by a program like Chardet
<https://github.com/chardet/chardet>. I did not like the idea of single
prober per encoding table that could lead to hard coding specifications.

I wanted to challenge the existing methods of discovering originating
encoding.

You could consider this issue as obsolete because of current norms :

You should indicate used charset encoding as described in standards

But the reality is different, a huge part of the internet still have
content with an unknown encoding. (*One could point out subrip subtitle
(SRT) for instance*)

This is why a popular package like Requests
<https://github.com/psf/Requests> embed Chardet to guess apparent encoding
on remote resources.

https://github.com/Ousret/charset_normalizer

https://pypi.org/project/charset-normalizer

Charset Normalizer <https://github.com/Ousret/charset_normalizer>. *A first
PoC. Currently at version 0.3*

The Real First Universal Charset Detector. No Cpp Bindings. (13-Sept-19)

*LICENSE MIT*

ahmed.tahri@cloudnursery.dev

A proof of concept to replace unreliable python chardet

Ahmed TAHRI

tags

participants (1)