[Python-Dev] Encoding detection in the standard library?
wolever at cs.toronto.edu
Tue Apr 22 17:48:07 CEST 2008
On 22-Apr-08, at 12:30 AM, Martin v. Löwis wrote:
>> IMO, encoding estimation is something that many web programs will
>> to deal with
> Can you please explain why that is? Web programs should not normally
> have the need to detect the encoding; instead, it should be specified
> always - unless you are talking about browsers specifically, which
> need to support web pages that specify the encoding incorrectly.
Two cases come immediately to mind: email and web forms.
When a web browser POSTs data, there is no standard way of
communicating which encoding it's using. There are some hints which
make it easier (accept-charset attributes, the encoding used to send
the page to the browser), but no guarantees.
Email is a smaller problem, because it usually has a helpful content-
type header, but that's no guarantee.
Now, at the moment, the only data I have to support this claim is my
experience with DrProject in non-English locations.
If I'm the only one who has had these sorts of problems, I'll go back
to "Unicode for Dummies".
>> so it might as well be built in; I would prefer the option
>> to run `text=input.encode('guess')` (or something similar) than
>> on an external dependency or worse yet using a hand-rolled algorithm.
> Ok, let me try differently then. Please feel free to post a patch to
> bugs.python.org, and let other people rip it apart.
> For example, I don't think it should be a codec, as I can't imagine it
> working on streams.
As things frequently are, it seems like this is a much larger problem
that I originally believed.
I'll go back and take another look at the problem, then come back if
new revelations appear.
More information about the Python-Dev