encoding problem with BeautifulSoup - problem when writing parsed text to file

Sat Oct 8 11:30:23 EDT 2011

On Wed, 05 Oct 2011 21:39:17 -0700, Greg wrote:

> Here is the final code for those who are struggling with similar
> problems:
> 
> ## open and decode file
> # In this case, the encoding comes from the charset argument in a meta
> tag
> # e.g. <meta charset="iso-8859-2">
> fileObj = open(filePath,"r").read()
> fileContent = fileObj.decode("iso-8859-2")
> fileSoup = BeautifulSoup(fileContent)

The fileObj.decode() step should be unnecessary, and is usually
undesirable; Beautiful Soup should be doing the decoding itself.

If you actually know the encoding (e.g. from a Content-Type header), you
can specify it via the fromEncoding parameter to the BeautifulSoup
constructor, e.g.:

	fileSoup = BeautifulSoup(fileObj.read(), fromEncoding="iso-8859-2")

If you don't specify the encoding, it will be deduced from a meta tag if
one is present, or a Unicode BOM, or using the chardet library if
available, or using built-in heuristics, before finally falling back to
Windows-1252 (which seems to be the preferred encoding of people who don't
understand what an encoding is or why it needs to be specified).