Is there a way to get utf-8 out of a Unicode string?
Fredrik Lundh
fredrik at pythonware.com
Mon Oct 30 02:30:39 EST 2006
thebjorn wrote:
> I've got a database (ms sqlserver) that's (way) out of my control,
> where someone has stored utf-8 encoded Unicode data in regular varchar
> fields, so that e.g. the string 'Blåbærsyltetøy' is in the database
> as 'Bl\xc3\xa5b\xc3\xa6rsyltet\xc3\xb8y' :-/
>
> Then I read the data out using adodbapi (which returns all strings as
> Unicode) and I get u'Bl\xc3\xa5b\xc3\xa6rsyltet\xc3\xb8y'. I couldn't
> find any way to get back to the original short of:
>
> def unfk(s):
> return eval(repr(s)[1:]).decode('utf-8')
>
> i.e. chopping off the u in the repr of a unicode string, and relying on
> eval to interpret the \xHH sequences.
>
> Is there a less hack'ish way to do this?
first, check if you can get your database adapter to understand that the
database contains UTF-8 and not ISO-8859-1. if that's not possible, you
can roundtrip via ISO-8859-1 yourself:
>>> u = u'Bl\xc3\xa5b\xc3\xa6rsyltet\xc3\xb8y'
>>> u
u'Bl\xc3\xa5b\xc3\xa6rsyltet\xc3\xb8y'
>>> u.encode("iso-8859-1")
'Bl\xc3\xa5b\xc3\xa6rsyltet\xc3\xb8y'
>>> u.encode("iso-8859-1").decode("utf-8")
u'Bl\xe5b\xe6rsyltet\xf8y'
>>> print u.encode("iso-8859-1").decode("utf-8")
Blåbærsyltetøy
</F>
More information about the Python-list
mailing list