[Tutor] Trouble in dealing with special characters.

Steven D'Aprano steve at pearwood.info
Fri Dec 7 04:50:49 EST 2018


On Fri, Dec 07, 2018 at 01:28:18PM +0530, Sunil Tech wrote:
> Hi Tutor,
> 
> I have a trouble with dealing with special characters in Python 

There are no special characters in Python. There are only Unicode 
characters. All characters are Unicode, including those which are also 
ASCII.

Start here:

https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/

https://blog.codinghorror.com/there-aint-no-such-thing-as-plain-text/

https://www.youtube.com/watch?v=sgHbC6udIqc

https://nedbatchelder.com/text/unipain.html


https://docs.python.org/3/howto/unicode.html

https://docs.python.org/2/howto/unicode.html


Its less than a month away from 2019. It is sad and shameful to be 
forced to use only ASCII, and nearly always unnecessary. Writing code 
that only supports the 128 ASCII characters is like writing a calculator 
that only supports numbers from 1 to 10.

But if you really must do so, keep reading.


> Below is
> the sentence with a special character(apostrophe) "MOUNTAIN VIEW WOMEN’S
> HEALTH CLINIC" with actually should be "MOUNTAIN VIEW WOMEN'S HEALTH CLINIC
> ".

Actually, no, it should be precisely what it is: "WOMEN’S" is correct, 
since that is an apostrophe. Changing the ’ to an inch-mark ' is not 
correct. But if you absolutely MUST change it:

mystring = "MOUNTAIN VIEW WOMEN’S HEALTH CLINIC"
mystring = mystring.replace("’", "'")

will do it in Python 3. In Python 2 you have to write this instead:

# Python 2 only
mystring = u"MOUNTAIN VIEW WOMEN’S HEALTH CLINIC"
mystring = mystring.replace(u"’", u"'")


to ensure Python uses Unicode strings.

What version of Python are you using, and what are you doing that gives 
you trouble?

It is very unlikely that the only way to solve the problem is to throw 
away the precise meaning of the text you are dealing with by reducing it 
to ASCII.

In Python 3, you can also do this:

mystring = ascii(mystring)

but the result will probably not be what you want.


> Please help, how to identify these kinds of special characters and replace
> them with appropriate ASCII?

For 99.99% of the characters, there is NO appropriate ASCII. What 
ASCII character do you expect for these?

    § π Й খ ₪ ∀ ▶ 丕 ☃ ☺️

ASCII, even when it was invented in 1963, wasn't sufficient even for 
American English (no cent sign, no proper quotes, missing punctuation 
marks) let alone British English or international text.

Unless you are stuck communicating with an ancient program written in 
the 1970s or 80s that cannot be upgraded, there are few good reasons to 
cripple your program by only supporting ASCII text.

But if you really need to, this might help:

http://code.activestate.com/recipes/251871-latin1-to-ascii-the-unicode-hammer/

http://code.activestate.com/recipes/578243-repair-common-unicode-mistakes-after-theyve-been-m/




-- 
Steve


More information about the Tutor mailing list