Mailman 3 [ANN] charset-normalizer 1.3.0 like Chardet, encoding and language detection - Python-announce-list

Oct. 4, 2019

      Hi everyone,

git : https://github.com/Ousret/charset_normalizer
pypi : https://pypi.org/project/charset-normalizer/
docs : https://charset-normalizer.readthedocs.io/en/latest/

*For remainder :*

Library that help you read text from unknown charset encoding.
Project motivated by chardet, I'm trying to resolve the issue by taking
another approach.
All IANA character set names for which the Python core library provides
codecs are supported.

*Changes :*

   -

   *Improvement :* Adjustement in frequencies.json about Chinese

   -

   *Feature :* Added the possibility to list encoding aliases for a match

   -

   *Feature :* Support submatch in match (list of submatch that
produce the EXACT same output as a match)

   - *Bugfix :* Sequence having lenght bellow 10 chars was not checked by
   ProbeChaos at all. (#14
   <https://github.com/Ousret/charset_normalizer/pull/14>)
   - *Bugfix :* Legacy detect method inspired by chardet was not returning
   intended result when having no result. (#14
   <https://github.com/Ousret/charset_normalizer/pull/14>)
   - *Bugfix :* from_bytes parameters *steps* and *chunk_size* were not
   adapted to sequence len if provided values were not fitted to content.
   Therefore could lead to misdetection on small content.
   - *Feature :* import charset_normalizer is enough to provide additional
   help when you encounter UnicodeDecodeError exception.
   - *Feature :* You can limit the search to some encoding when looking for
   a match with parameter cp_isolation. List of str. for from_bytes
   from_path and from_fp.
   - *Feature :* You can exclude some encoding when searching for a match
   with parameter cp_exclusion. List of str. for from_bytes from_path and
   from_fp.
   - *Improvement :* Detection has been globally improved.
   - *Feature :* Added explain boolean positional parameter to print out
   what actually happen when searching for a match.
   - *Improvement :* best() method of CharsetNormalizerMatches has been
   rewritten for better readability.
   - *Feature :* Added has_submatch, percent_chaos and percent_coherence
   properties on single match object.
   - *Improvement :* Added aliases to CharsetNormalizerMatches class.
   CharsetDetector; EncodingDetector and CharsetDoctor.
   - *Feature :* Added preemptive behaviour. Looking for a declared
   encoding. Using positional parameter preemptive_behaviour. Default to
   True. Does not take declared encoding for it, testing it first.
   - *Feature :* Now support unicodedata2 backport. To benefit from it
   install using pip install charset-normalizer[UnicodeDataBackport].
   Python 3.7 have UnicodeData v11. You could upgrade it to v12.

Feature Chardet <https://github.com/chardet/chardet> Charset Normalizer
cChardet <https://github.com/PyYoshi/cChardet>
Fast [image: x]
[image: white_check_mark]
[image: white_check_mark]
⚡
Universal** [image: x] [image: white_check_mark] [image: x]
Reliable *without* distinguishable standards [image: x] [image:
white_check_mark] [image: white_check_mark]
Reliable *with* distinguishable standards [image: white_check_mark] [image:
white_check_mark] [image: white_check_mark]
Free & Open [image: white_check_mark] [image: white_check_mark] [image:
white_check_mark]
Native Python [image: white_check_mark] [image: white_check_mark] [image: x]
Detect spoken language [image: x] [image: white_check_mark] N/A
Supported Encoding 30 [image: tada] 90
<https://charset-normalizer.readthedocs.io/en/latest/support.html> 40

[ANN] charset-normalizer 1.3.0 like Chardet, encoding and language detection

Ahmed TAHRI

tags

participants (1)