Strange problems with encoding
Michael Hudson
mwh at python.net
Thu Nov 6 08:56:43 EST 2003
Rudy Schockaert <rudy.schockaert at pandoraSTOPSPAM.be> writes:
> Sebastian Meyer wrote:
>
> > Hi newsgroup,
> > i am trying to replace german special characters in strings like
> > str = re.sub('ö', 'oe', str)
> > When i work with this, i always get the message
> > UniCode Error: ASCII decoding error : ordinal not in range(128)
> > Yes i have googled, i searched the faq, manual and python library
> > and
> > searched all known soruces of information. I played with the python
> > builtin function encode to enforce the rigth encoding, but the error
> > stays the same. I ve read a lot about UniCode and internal conversion
> > about Strings done by python, but somehow i ve missed the clue.
> > Nope, python says Huuups... ordinal not in range(128), ;-(
> > Anyone of you having any idea?? Seems like i am too stupid to read
> > documentation carefully., perhaps i misunderstand something...
> > thanks for your help in advance
> > Sebastian
>
> I'm experiencing something similar for the moment. I try to
> base64-encode Unicode strings and I get the exact same errormessage.
"base64-encoding Unicode strings" is not a particularly well defined
operation. "base64-encoding" is a way of turning *binary data* into a
particularly "safe" sequence of ascii characters.
Unicode (in some sense) is a family of ways of representing strings of
characters as binary data.
So to base-64 encode a Unicode string, you need to choose *which*
member of this family you're going to use, which is to say the
encoding. UTF-8 would seem a good bet.
But...
> >>> s = u'ö'
> >>> s
> u'\xf6'
> >>> s.encode('base64')
> Traceback (most recent call last):
> File "<interactive input>", line 1, in ?
> File "C:\Python23\lib\encodings\base64_codec.py", line 24, in
> base64_encode
> output = base64.encodestring(input)
> File "C:\Python23\lib\base64.py", line 39, in encodestring
> pieces.append(binascii.b2a_base64(chunk))
> UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in
> position 0: ordinal not in range(128)
>>> u'ö'.encode('utf-8').encode('base64')
'w7Y=\n'
> When I don't specify it's unicode it works:
> >>> s = 'ö'
> >>> s
> '\xf6'
> >>> s.encode('base64')
> '9g==\n'
Well, this works because your terminal seems to be latin-1:
>>> u'ö'.encode('latin-1').encode('base64')
'9g==\n'
What would you like to do with a character that isn't in latin-1?
> The reason I want to base64-encode these unicode strings is because I
> get those as input and want to store them in a MySQL database using
> SQLObject.
! Why can't you just encode them as utf-8 strings? (Or, thinking
about it, why doesn't SQLObject support unicode?)
Cheers,
mwh
--
I think if we have the choice, I'd rather we didn't explicitly put
flaws in the reST syntax for the sole purpose of not insulting the
almighty. -- /will on the doc-sig
More information about the Python-list
mailing list