[Python-Dev] Encoding detection in the standard library?

"Martin v. Löwis" martin at v.loewis.de
Tue Apr 22 23:16:02 CEST 2008


>> Can you please explain why that is? Web programs should not normally
>> have the need to detect the encoding; instead, it should be specified
>> always - unless you are talking about browsers specifically, which
>> need to support web pages that specify the encoding incorrectly.
> 
> Any program that needs to examine the contents of
> documents/feeds/whatever on the web needs to deal with
> incorrectly-specified encodings

That's not true. Most programs that need to examine the contents of
a web page don't need to guess the encoding. In most such programs,
the encoding can be hard-coded if the declared encoding is not
correct. Most such programs *know* what page they are webscraping,
or else they couldn't extract the information out of it that they
want to get at.

As for feeds - can you give examples of incorrectly encoded one
(I don't ever use feeds, so I honestly don't know whether they
are typically encoded incorrectly. I've heard they are often XML,
in which case I strongly doubt they are incorrectly encoded)

As for "whatever" - can you give specific examples?

> (which, sadly, is rather common). The
> set of programs of programs that need this functionality is probably the
> same set that needs BeautifulSoup--I think that set is larger than just
> browsers <grin>

Again, can you give *specific* examples that are not web browsers?
Programs needing BeautifulSoup may still not need encoding guessing,
since they still might be able to hard-code the encoding of the web
page they want to process.

In any case, I'm very skeptical that a general "guess encoding"
module would do a meaningful thing when applied to incorrectly
encoded HTML pages.

Regards,
Martin


More information about the Python-Dev mailing list