Python 2.4 vs 2.5 - Unicode error

Wed Jan 21 19:08:54 EST 2009

On Jan 22, 9:50 am, Gaurav Veda <vedagau... at gmail.com> wrote:
> > The 0xc2 strongly suggests that you are feeding the beast data encoded
> > in UTF-8 while giving it no reason to believe that it is in fact not
> > encoded in ASCII. Curiously the first errant byte is a long way (4KB)
> > into your data. Consider doing
> >     print repr(data)
> > to see what you've actually got there.
> >>> sqlStr[4352:4362]
>
> ' and 25\xc2\xb0F'

That's the UTF-8 version of ' and 25°F' where the character between
the 25 and the F is U+00B0 DEGREE SIGN ... interesting stuff to have
in an SQL query string.

>
> All I want to do is to just replace all the non-ascii characters by a
> space.

I can't imagine why you would want to do that to data, let alone to an
SQL query.

I can't see any evidence that you actually tried to do that, anyway.

To replace non-ASCII characters in a UTF-8-encoded string by spaces:
| >>> u8 = ' and 25\xc2\xb0F'
| >>> u = u8.decode('utf8')
| >>> ''.join([chr(ord(c)) if c <= u'\x7f' else ' ' for c in u])
| ' and 25 F'

>
> > I'm a little skeptical about the "2.4 works, 2.5 doesn't" notion --
> > different versions of mysql, perhaps?
>
> I am trying to put content into the mysql server running on machine A,
> from machine B & machine C with different versions of python. So I
> don't think this is a mysql issue.

Terminology confusion. Consider the possibility of different versions
of MySQLdb (the client interface package) on the client machines B and
C.

Also consider the possibility that you didn't run exactly the same
code on B and C.

> > Show at the very least the full traceback that you get. Try to write a
> > short script that demonstrates the problem with 2.5 and no problem
> > with 2.4, so that (a) it is apparent what you are doing (b) the
> > problem can be reproduced if necessary by someone with access to
> > mysql.

How about a very small script which includes the minimum necessary to
run these two lines (with appropriate substitutions for column_x and
table_y:
sql_str = "select column_x from table_y where column_x = '\xc2\xb0'"
cursor.execute(sql_str)

and run that on B and C

>
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "putDataIntoDB.py", line 164, in <module>
>     cursor.execute(sqlStr)
>   File "/usr/lib64/python2.5/site-packages/MySQLdb/cursors.py", line
> 146, in execute
>     query = query.encode(charset)
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position
> 4359: ordinal not in range(128)
>
> > You might like to explain why you think that doubling backslashes in
> > your SQL is a good idea, and amplify "some processing on the text".
>
> I thought this will achieve 2 things.
> a) It will escape any unicode character (obviously, I was wrong. Got
> carried away by the display. I thought \xc2 will get escaped to \\xc2,
> which is completely preposterous).
> b) It will make sure that the escape sequences in the string (e.g.
> '\n') are received by mysql as an escape sequence.

Run-time programmatic fiddling with an SQL query string is dangerous
and tricky at the best of times, worse when you don't inspect the
result before you press the launch button.

Cheers,
John