On Wed, Feb 10, 2010 at 1:03 PM, kj <span dir="ltr"><no.email@please.post></span> wrote:<br><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">


<div class="im">>What are y and z?<br>

<br>

</div>  x = "%s %s" % (table['id'], table.tr.renderContents())<br>

<br>

where the variable table represents a BeautifulSoup.Tag instance.<br>

<div class="im"><br>

>Are they unicode or strings?<br>

<br>

</div>The first item (table['id']) is unicode, and the second is str.<br></blockquote><div><br></div><div>The problem is you are mixing unicode and strings; if you want to  support unicode at all, you should use it -everywhere- within your program.</div>


<div><br></div><div>At the entry-points, you should convert from strings to unicode, then use unicode -exclusively-.</div><div><br></div><div>Here, you are trying to combine a unicode variable with a string variable, and Python is -trying- to convert it for you, but it can't. I suspect you aren't actually realizing this implicit conversion is going on: Python's trying to produce an output string according to what you've requested. To do that, it is upgrading x from string to unicode, and then writing to that string first table['id'] then table.tr.renderContents().</div>


<div><br></div><div>Its the table.tr.renderContents that, I believe, is your problem: it is a byte string, but it contains encoded data. You have to 'upgrade' it to unicode before you can pump it over into x, because Python doesn't know what encoding that is. It just knows there's a character somewhere in there that doesn't fit into ASCII.</div>


<div><br></div><div>I can't quite tell you what the encoding is, though it may be UTF8. Try: table.tr.renderContents().decode("utf8")</div><div><br></div><div>Its an easy mistake to make. </div><div><br></div>


<div>To further illustrate, do this:</div><div><br></div><div>>>> "%s %s" % (u'hi', u'bye')</div><div>u'hi bye'</div><div><div><br></div><div>Even though you're assigning a regular string to "x", you're using a format str to merge in a unicode string to it. So Python implicitly upgrades your format string, as if you had typed:</div>


<div><br></div><div>>>> u"%s %s" % (u'hi', 'bye')</div><div><br></div><div>Upgrading a string to unicode is easy, provided a string contains nothing but plain ASCII. The moment it contains encoded data, the default translation fails. So you have to explicitly tell Python how to convert the byte string back into unicode-- meaning, you have to decode it.</div>


<div><br></div><div>Take this example:</div><div><br></div><div><div>>>> s = u"He\u2014llo".encode("utf8")</div><div>>>> s</div><div>'He\xe2\x80\x94llo'</div><div>>>> "%s %s" % (u'hi', s)</div>


<div>Traceback (most recent call last):</div><div>  File "<stdin>", line 1, in <module></div><div>UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 2: ordinal not in range(128)</div>


<div><div>>>> "%s %s" % (u'hi', s.decode("utf8"))</div><div>u'hi He\u2014llo'</div><div><br></div><div>I started out with a string, "s", containing Hello with an em-dash stuck in the middle. I encoded it as UTF8: its now a byte string  similar to what may exist on any webpage you get. Remember, pure "unicode" exists only in memory. Everything on the internet or a file is /encoded/. You work with unicode in memory, then encode it to some form when writing it out-- anywhere.</div>


<div><br></div><div>When I first try to combine the two, I get an error like you got.</div><div><br></div><div>But if I explicitly decode my 's', it all works fine. The encoded form is translated back into unicode.</div>


<div><br></div><div>As for how to debug and deal with issues like this... never mix unicode and regular strings, use unicode strings exclusively in your logic, and decode as early as possible. Then the only problem you're likely to run into is situations where a claimed encoding/charset is a lie. (Web servers sometimes claim charset/encoding A, or a file may claim charset/encoding B, even though the file was written out by someone with charset/encoding C... and there's no easy way to go around fixing such situations)</div>


</div></div></div></div><div><br></div>HTH,<br clear="all"><div name="mailplane_signature">--S</div>