usage of <string>.encode('utf-8','xmlcharrefreplace')?
bbxx789_05ss at yahoo.com
Tue Feb 19 07:38:24 CET 2008
On Feb 18, 10:52 pm, "Carsten Haese" <cars... at uniqsys.com> wrote:
> On Mon, 18 Feb 2008 21:36:17 -0800 (PST), J Peyret wrote
> > Well, as usual I am confused by unicode encoding errors.
> > I have a string with problematic characters in it which I'd like to
> > put into a postgresql table.
> > That results in a postgresql error so I am trying to fix things with
> > <string>.encode
> > >>> s = 'he Company\xef\xbf\xbds ticker'
> > >>> print s
> > he [UTF-8?]Company�s ticker
> > Trying for an encode:
> > >>> print s.encode('utf-8')
> > Traceback (most recent call last):
> > File "<input>", line 1, in <module>
> > UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position
> > 10: ordinal not in range(128)
> > OK, that's pretty much as expected, I know this is not valid utf-8.
> Actually, the string *is* valid UTF-8, but you're confused about encoding and
> decoding. Encoding is the process of turning a Unicode object into a byte
> string. Decoding is the process of turning a byte string into a Unicode object.
...or to put it more simply: encode() is used to covert a unicode
string into a regular string. A unicode string looks like this:
s = u'\u0041'
but your string looks like this:
s = 'he Company\xef\xbf\xbds ticker'
Note that there is no 'u' in front of your string. Therefore, you
can't call encode() on that string.
> Also, why are the exceptions above complaining about the 'ascii'
> codec if I am asking for 'utf-8' conversion?
If a python function requires a unicode string and a unicode string
isn't provided, then python will implicitly try to convert the string
it was given into a unicode string. In order to convert a given
string into a unicode string, python needs to know the secret code
that was used to produce the given string. The secret code is
otherwise known as a 'codec'. When python attempts an implicit
conversion of a given string into a unicode string, python uses the
default codec, which is normally set to 'ascii'. Since your string
contains non-ascii characters, you get an error. That all happens
long before your 'utf-8' argument ever comes into play.
decode() is used to convert a regular string into a unicode string
(the opposite of encode()). Your error is a 'decode' error(rather
than an 'encode' error):
because python is implicitly trying to convert the given regular
string into a unicode string with the default ascii codec, and python
is unable to do that.
More information about the Python-list