[ANN] charset-normalizer 1.3.0 like Chardet, encoding and language detection

Hi everyone, git : https://github.com/Ousret/charset_normalizer pypi : https://pypi.org/project/charset-normalizer/ docs : https://charset-normalizer.readthedocs.io/en/latest/ *For remainder :* Library that help you read text from unknown charset encoding. Project motivated by chardet, I'm trying to resolve the issue by taking another approach. All IANA character set names for which the Python core library provides codecs are supported. *Changes :* - *Improvement :* Adjustement in frequencies.json about Chinese - *Feature :* Added the possibility to list encoding aliases for a match - *Feature :* Support submatch in match (list of submatch that produce the EXACT same output as a match) - *Bugfix :* Sequence having lenght bellow 10 chars was not checked by ProbeChaos at all. (#14 <https://github.com/Ousret/charset_normalizer/pull/14>) - *Bugfix :* Legacy detect method inspired by chardet was not returning intended result when having no result. (#14 <https://github.com/Ousret/charset_normalizer/pull/14>) - *Bugfix :* from_bytes parameters *steps* and *chunk_size* were not adapted to sequence len if provided values were not fitted to content. Therefore could lead to misdetection on small content. - *Feature :* import charset_normalizer is enough to provide additional help when you encounter UnicodeDecodeError exception. - *Feature :* You can limit the search to some encoding when looking for a match with parameter cp_isolation. List of str. for from_bytes from_path and from_fp. - *Feature :* You can exclude some encoding when searching for a match with parameter cp_exclusion. List of str. for from_bytes from_path and from_fp. - *Improvement :* Detection has been globally improved. - *Feature :* Added explain boolean positional parameter to print out what actually happen when searching for a match. - *Improvement :* best() method of CharsetNormalizerMatches has been rewritten for better readability. - *Feature :* Added has_submatch, percent_chaos and percent_coherence properties on single match object. - *Improvement :* Added aliases to CharsetNormalizerMatches class. CharsetDetector; EncodingDetector and CharsetDoctor. - *Feature :* Added preemptive behaviour. Looking for a declared encoding. Using positional parameter preemptive_behaviour. Default to True. Does not take declared encoding for it, testing it first. - *Feature :* Now support unicodedata2 backport. To benefit from it install using pip install charset-normalizer[UnicodeDataBackport]. Python 3.7 have UnicodeData v11. You could upgrade it to v12. Feature Chardet <https://github.com/chardet/chardet> Charset Normalizer cChardet <https://github.com/PyYoshi/cChardet> Fast [image: x] [image: white_check_mark] [image: white_check_mark] ⚡ Universal** [image: x] [image: white_check_mark] [image: x] Reliable *without* distinguishable standards [image: x] [image: white_check_mark] [image: white_check_mark] Reliable *with* distinguishable standards [image: white_check_mark] [image: white_check_mark] [image: white_check_mark] Free & Open [image: white_check_mark] [image: white_check_mark] [image: white_check_mark] Native Python [image: white_check_mark] [image: white_check_mark] [image: x] Detect spoken language [image: x] [image: white_check_mark] N/A Supported Encoding 30 [image: tada] 90 <https://charset-normalizer.readthedocs.io/en/latest/support.html> 40
participants (1)
-
Ahmed TAHRI