[Python-Dev] Automatic encoding detection [was: Re: Python3 "complexity" - 2 use cases]

Steven D'Aprano steve at pearwood.info
Tue Jan 14 03:21:26 CET 2014


On Mon, Jan 13, 2014 at 07:58:43PM -0500, Terry Reedy wrote:

> This discussion strikes me as more appropriate for python-ideas. That 
> said, I am leery of a heuristics module in the stdlib. When is a change 
> a 'bug fix'? and when is it an 'enhancement'?

Depends on the nature of the heuristic. For example, there's a simple 
"guess the encoding of text files" heuristic which uses the presence of 
a BOM to pick the encoding:

- read the first four bytes in binary mode
- if bytes 0 and 1 are FEFF or FFFE, then the encoding is UTF-16;
- if bytes 0 through 2 are EFBBBF, then the encoding is UTF-8;
- if bytes 0 through 3 are 0000FEFF or FFFE0000, then the encoding 
  is UTF-32;
- if bytes 0 through 2 are 2B2F76 and byte 3 is 38, 39, 2B or 2F, 
  then the encoding is UTF-7;
- otherwise the encoding is unknown.

Here a bug fix versus an enhancement is easy: a bug fix is (say) 
getting one of the BOMs wrong (suppose it tested for EFFF instead of 
FEFF, that would be a bug); an enhancement would be adding a new 
BOM/encoding detector (say, F7644C for UTF-1).

The same would not apply to, for instance, the chardet library, where 
detection is based on statistics. If the library adjusts a frequency 
table, does that reflect a bug or an enhancement or both?



-- 
Steven


More information about the Python-Dev mailing list