[Python-Dev] Encoding detection in the standard library?

Mike Klaas mike.klaas at gmail.com
Wed Apr 23 01:10:17 CEST 2008


On 22-Apr-08, at 2:16 PM, Martin v. Löwis wrote:
>
>> Any program that needs to examine the contents of
>> documents/feeds/whatever on the web needs to deal with
>> incorrectly-specified encodings
>
> That's not true. Most programs that need to examine the contents of
> a web page don't need to guess the encoding. In most such programs,
> the encoding can be hard-coded if the declared encoding is not
> correct. Most such programs *know* what page they are webscraping,
> or else they couldn't extract the information out of it that they
> want to get at.

I certainly agree that if the target set of documents is small enough  
it is possible to hand-code the encoding.  There are many  
applications, however, that need to examine the content of an  
arbitrary, or at least non-small set of web documents.  To name a few  
such applications:

  - web search engines
  - translation software
  - document/bookmark management systems
  - other kinds of document analysis (market research, seo, etc.)

> As for feeds - can you give examples of incorrectly encoded one
> (I don't ever use feeds, so I honestly don't know whether they
> are typically encoded incorrectly. I've heard they are often XML,
> in which case I strongly doubt they are incorrectly encoded)

I also don't have much experience with feeds.  My statement is based  
on the fact that chardet, the tool that has been cited most in this  
thread, was written specifically for use with the author's feed  
parsing package.

> As for "whatever" - can you give specific examples?

Not that I can substantiate.  Documents & feeds covers a lot of what  
is on the web--I was only trying to make the point that on the web,  
whenever an encoding can be specified, it will be specified  
incorrectly for a significant chunk of exemplars.

>> (which, sadly, is rather common). The
>> set of programs of programs that need this functionality is  
>> probably the
>> same set that needs BeautifulSoup--I think that set is larger than  
>> just
>> browsers <grin>
>
> Again, can you give *specific* examples that are not web browsers?
> Programs needing BeautifulSoup may still not need encoding guessing,
> since they still might be able to hard-code the encoding of the web
> page they want to process.

Indeed, if it is only one site it is pretty easy to work around.  My  
main use of python is processing and analyzing hundreds of millions of  
web documents, so it is pretty easy to see applications (which I have  
listed above).  I think that libraries like Mark Pilgrim's FeedParser  
and BeautifulSoup are possible consumers of guessing as well.

> In any case, I'm very skeptical that a general "guess encoding"
> module would do a meaningful thing when applied to incorrectly
> encoded HTML pages.

Well, it does.  I wish I could easily provide data on how often it is  
necessary over the whole web, but that would be somewhat difficult to  
generate.  I can say that it is much more important to be able to  
parse all the different kinds of encoding _specification_ on the web  
(Content-Type/Content-Encoding/<meta http-equiv tags, etc), and the  
malformed cases of these.

I can also think of good arguments for excluding encoding detection  
for maintenance reasons: is every case of the algorithm guessing wrong  
a bug that needs to be fixed in the stdlib?  That is an unbounded  
commitment.

-Mike


More information about the Python-Dev mailing list