[Tutor] Decode and Encode

Sunil Tech sunil.techspk at gmail.com
Wed Jan 28 13:52:13 CET 2015


Thank you for all your replies​​


On Wed, Jan 28, 2015 at 4:56 PM, Steven D'Aprano <steve at pearwood.info>
wrote:

> On Wed, Jan 28, 2015 at 03:05:58PM +0530, Sunil Tech wrote:
> > Hi All,
> >
> > When i copied a text from web and pasted in the python-terminal, it
> > automatically coverted into unicode(i suppose)
> >
> > can anyone tell me how it does?
> > Eg:
> > >>> p = "你好"
> > >>> p
> > '\xe4\xbd\xa0\xe5\xa5\xbd'
>
> It is hard to tell exactly, since we cannot see what p is supposed to
> be. I am predicting that you are using Python 2.7, which uses
> byte-strings by default, not Unicode text-strings.
>
> To really answer your question correctly, we need to know the operating
> system and which terminal you are using, and the terminal's encoding. I
> will guess a Linux system, with UTF-8 encoding in the terminal.
>
> So, when you paste some Unicode text into the terminal, the terminal
> receives the UTF-8 bytes, and displays the characters:
>
> 你好
>
> On my system, they display like boxes, but I expect that they are:
>
> CJK UNIFIED IDEOGRAPH-4F60
> CJK UNIFIED IDEOGRAPH-597D
>
> But, because this is Python 2, and you used byte-strings "" instead of
> Unicode strings u"", Python sees the raw UTF-8 bytes.
>
> py> s = u'你好'  # Note this is a Unicode string u'...'
> py> import unicodedata
> py> for c in s:
> ...     print unicodedata.name(c)
> ...
> CJK UNIFIED IDEOGRAPH-4F60
> CJK UNIFIED IDEOGRAPH-597D
> py> s.encode('UTF-8')
> '\xe4\xbd\xa0\xe5\xa5\xbd'
>
> which matches your results.
>
> Likewise for this example:
>
> py> s = u'ªîV'  # make sure to use Unicode u'...'
> py> for c in s:
> ...     print unicodedata.name(c)
> ...
> FEMININE ORDINAL INDICATOR
> LATIN SMALL LETTER I WITH CIRCUMFLEX
> LATIN CAPITAL LETTER V
> py> s.encode('utf8')
> '\xc2\xaa\xc3\xaeV'
>
>
> which matches yours:
>
> > >>> o = 'ªîV'
> > >>> o
> > '\xc2\xaa\xc3\xaeV'
>
>
> Obviously all this is confusing and harmful. In Python 3, the interpeter
> defaults to Unicode text strings, so that this issue goes away.
>
>
> --
> Steve
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> To unsubscribe or change subscription options:
> https://mail.python.org/mailman/listinfo/tutor
>


More information about the Tutor mailing list