encoding problem with BeautifulSoup - problem when writing parsed text to file

Steven D'Aprano steve+comp.lang.python at pearwood.info
Wed Oct 5 23:40:14 EDT 2011


On Wed, 05 Oct 2011 16:35:59 -0700, Greg wrote:

> Hi, I am having some encoding problems when I first parse stuff from a
> non-english website using BeautifulSoup and then write the results to a
> txt file.

If you haven't already read this, you should do so:

http://www.joelonsoftware.com/articles/Unicode.html



> I have the text both as a normal (text) and as a unicode string (utext):
> print repr(text)
> 'Branie zak\xc2\xb3adnik\xc3\xb3w'

This is pretty much meaningless, because we don't know how you got the 
text and what it actually is. You're showing us a bunch of bytes, with no 
clue as to whether they are the right bytes or not. Considering that your 
Unicode text is also incorrect, I would say it is *not* right and your 
description of the problem is 100% backwards: the problem is not 
*writing* the text, but *reading* the bytes and decoding it.


You should do something like this:

(1) Inspect the web page to find out what encoding is actually used.

(2) If the web page doesn't know what encoding it uses, or if it uses 
bits and pieces of different encodings, then the source is broken and you 
shouldn't expect much better results. You could try guessing, but you 
should expect mojibake in your results.

http://en.wikipedia.org/wiki/Mojibake

(3) Decode the web page into Unicode text, using the correct encoding.

(4) Do all your processing in Unicode, not bytes.

(5) Encode the text into bytes using UTF-8 encoding.

(6) Write the bytes to a file.


[...]
> Now I am trying to save this to a file but I never get the encoding
> right. Here is what I tried (+ lot's of different things with encode,
> decode...):

> outFile=codecs.open( filePath, "w", "UTF8" ) 
> outFile.write(utext)
> outFile.close()

That's the correct approach, but it won't help you if utext contains the 
wrong characters in the first place. The critical step is taking the 
bytes in the web page and turning them into text.

How are you generating utext?



-- 
Steven



More information about the Python-list mailing list