[Python-Dev] Encoding detection in the standard library?

Tue Apr 22 17:48:07 CEST 2008

On 22-Apr-08, at 12:30 AM, Martin v. Löwis wrote:
>> IMO, encoding estimation is something that many web programs will  
>> have
>> to deal with
> Can you please explain why that is? Web programs should not normally
> have the need to detect the encoding; instead, it should be specified
> always - unless you are talking about browsers specifically, which
> need to support web pages that specify the encoding incorrectly.
Two cases come immediately to mind: email and web forms.
When a web browser POSTs data, there is no standard way of  
communicating which encoding it's using.  There are some hints which  
make it easier (accept-charset attributes, the encoding used to send  
the page to the browser), but no guarantees.
Email is a smaller problem, because it usually has a helpful content- 
type header, but that's no guarantee.

Now, at the moment, the only data I have to support this claim is my  
experience with DrProject in non-English locations.
If I'm the only one who has had these sorts of problems, I'll go back  
to "Unicode for Dummies".

>> so it might as well be built in; I would prefer the option
>> to run `text=input.encode('guess')` (or something similar) than  
>> relying
>> on an external dependency or worse yet using a hand-rolled algorithm.
> Ok, let me try differently then. Please feel free to post a patch to
> bugs.python.org, and let other people rip it apart.
> For example, I don't think it should be a codec, as I can't imagine it
> working on streams.

As things frequently are, it seems like this is a much larger problem  
that I originally believed.

I'll go back and take another look at the problem, then come back if  
new revelations appear.