Is there a way to get utf-8 out of a Unicode string?

Mon Oct 30 02:30:39 EST 2006

thebjorn wrote:

> I've got a database (ms sqlserver) that's (way) out of my control,
> where someone has stored utf-8 encoded Unicode data in regular varchar
> fields, so that e.g. the string 'Blåbærsyltetøy' is in the database
> as 'Bl\xc3\xa5b\xc3\xa6rsyltet\xc3\xb8y' :-/
> 
> Then I read the data out using adodbapi (which returns all strings as
> Unicode) and I get u'Bl\xc3\xa5b\xc3\xa6rsyltet\xc3\xb8y'. I couldn't
> find any way to get back to the original short of:
> 
>   def unfk(s):
>       return eval(repr(s)[1:]).decode('utf-8')
> 
> i.e. chopping off the u in the repr of a unicode string, and relying on
> eval to interpret the \xHH sequences.
> 
> Is there a less hack'ish way to do this?

first, check if you can get your database adapter to understand that the 
database contains UTF-8 and not ISO-8859-1.  if that's not possible, you 
can roundtrip via ISO-8859-1 yourself:

 >>> u = u'Bl\xc3\xa5b\xc3\xa6rsyltet\xc3\xb8y'
 >>> u
u'Bl\xc3\xa5b\xc3\xa6rsyltet\xc3\xb8y'
 >>> u.encode("iso-8859-1")
'Bl\xc3\xa5b\xc3\xa6rsyltet\xc3\xb8y'
 >>> u.encode("iso-8859-1").decode("utf-8")
u'Bl\xe5b\xe6rsyltet\xf8y'
 >>> print u.encode("iso-8859-1").decode("utf-8")
Blåbærsyltetøy

</F>