Problems Writing £ (pound sterling) To MS SQL Server using pymssql

Mon Nov 17 12:05:43 EST 2008

On Mon, 2008-11-17 at 15:55 +0000, Darren Mansell wrote:
> On Mon, 2008-11-17 at 15:24 +0000, Tim Golden wrote:
> > Darren Mansell wrote:
> > > Hi. 
> > > 
> > > I'm relatively new to python so please be gentle :)
> > > 
> > > I'm trying to write a £ symbol to an MS SQL server using pymsssql . This
> > > works but when selecting the data back (e.g. using SQL management
> > > studio) the £ symbol is replaced with Â£ (latin capital letter A with
> > > circumflex).
> > 
> > 
> > This is a bit of a non-answer but... use pyodbc[*],
> > use NVARCHAR cols, and use unicode values on insert:
> > 
> 
> Thanks for the help. Unfortunately pyodbc seems to only work on Windows.
> I need to connect to the SQL server from a Linux box.
> 
> The db schema is very set in stone, I can't do anything with it. I'm
> currently opening autogenerated SQL scripts, decoding them from utf-16
> and then back into utf-8 for pymssql to run them.
> 
> It's been working great for ages until someone noticed the £ symbols had
> this extra character in there..
> 

As I was trying to explain in my other email, the £ does *not* have an
"extra symbol" attached to it.  It is being encoded at UTF-8 and then
decoded as Latin-1 (ISO-8859-1).  If you had other higher-order (>
ASCII) characters in your text, they would also be mis-decoded, but
would probably not show the original character in the output.  That was
just a coincidence.  

For example, if you had the character u'\xe6' (æ) in your input, which
has the binary representation 1110 0110, it would be encoded in UTF-8 as
follows:

mask:         110x xxxx  10xx xxxx
byte:                11    10 0110
encoding:     1100 0011  1010 0110
hex:             c    3     a    6
bytestring:   '\xc3\xa6'

If you decode it as UTF-8, you get u'æ', but if you decode it as
latin-1, you get u'Ã¦'.  Note that the latin-1 decoding here does not
include æ.

So what you are seeing is best thought of as two garbage characters, one
of which happens (by coincidence only) to be the same as your original
character.  If you decode the bytes returned properly (as UTF-8), you
will get the bytes you put in, for all characters.

Cheers,
Cliff