Handling Special characters in python

Steven D'Aprano steve+comp.lang.python at pearwood.info
Tue Jan 1 13:01:51 CET 2013


On Tue, 01 Jan 2013 03:35:56 -0800, anilkumar.dannina wrote:

> I am facing one issue in my module. I am gathering data from sql server
> database. In the data that I got from db contains special characters
> like "endash". Python was taking it as "\x96". I require the same
> character(endash). How can I perform that. Can you please help me in
> resolving this issue.


"endash" is not a character, it is six characters.

On the other hand, "\x96" is a single byte:

py> c = u"\x96"
py> assert len(c) == 1


But it is not a legal Unicode character:

py> import unicodedata
py> unicodedata.name(c)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: no such name


So if it is not a Unicode character, it is probably a byte.

py> c = "\x96"
py> print c
�


To convert byte 0x96 to an n-dash character, you need to identify the 
encoding to use. 

(Aside: and *stop* using it. It is 2013 now, anyone who is not using 
UTF-8 is doing it wrong. Legacy encodings are still necessary for legacy 
data, but any new data should always using UTF-8.)

CP 1252 is one possible encoding, but there may be others:

py> uc = c.decode('cp1252')
py> unicodedata.name(uc)
'EN DASH'



-- 
Steven



More information about the Python-list mailing list