Chardet oddity

Mark Bourne nntp.mbourne at spamgourmet.com
Wed Oct 23 15:42:00 EDT 2024


Albert-Jan Roskam wrote:
>     Today I used chardet.detect in the repl and it returned windows-1252
>     (incorrect, because it later resulted in a UnicodeDecodeError). When I ran
>     chardet as a script (which uses UniversalLineDetector) this returned
>     MacRoman. Isn't charset.detect the correct way? I've used this method many
>     times.
>     # Interpreter
>     >>> contents = open(FILENAME, "rb").read()
>     >>> chardet.detect(content)

Is that copy and pasted from the terminal, or retyped with possible 
transcription errors?  As written, you've assigned the open file handle 
to `contents`, but passed `content` (with no "s") to `chardet.detect` - 
so the result would depend on whatever was previously assigned to `content`.

>     {'encoding': 'Windows-1252', 'confidence': 0.7282676610947401, 'language':
>     ''}
>     # Terminal
>     $ python -m chardet FILENAME
>     FILENAME: MacRoman with confidence 0.7167379080370483
>     Thanks!
>     Albert-Jan

-- 
Mark.


More information about the Python-list mailing list