unicode(s, enc).encode(enc) == s ?

Marc 'BlackJack' Rintsch bj_666 at gmx.net
Fri Dec 28 06:43:13 EST 2007


On Fri, 28 Dec 2007 03:00:59 -0800, mario wrote:

> On Dec 27, 7:37 pm, "Martin v. Löwis" <mar... at v.loewis.de> wrote:
>> Certainly. ISO-2022 is famous for having ambiguous encodings. Try
>> these:
>>
>> unicode("Hallo","iso-2022-jp")
>> unicode("\x1b(BHallo","iso-2022-jp")
>> unicode("\x1b(JHallo","iso-2022-jp")
>> unicode("\x1b(BHal\x1b(Jlo","iso-2022-jp")
>>
>> or likewise
>>
>> unicode("\x1b$@BB","iso-2022-jp")
>> unicode("\x1b$BBB","iso-2022-jp")
>>
>> In iso-2022-jp-3, there are even more ways to encode the same string.
> 
> Wow, that's not easy to see why would anyone ever want that? Is there
> any logic behind this?
> 
> In your samples both of unicode("\x1b(BHallo","iso-2022-jp") and
> unicode("\x1b(JHallo","iso-2022-jp") give u"Hallo" -- does this mean
> that the ignored/lost bytes in the original strings are not illegal
> but *represent nothing* in this encoding?

They are not lost or ignored but escape sequences that tell how the
following bytes should be interpreted '\x1b(B' switches to ASCII and
'\x1b(J' to some "roman" encoding which is a superset of ASCII, so it
doesn't matter which one you choose unless the following bytes are all
ASCII.  And of course you can use that escape prefix as often as you want
within a string of ASCII byte values.

http://en.wikipedia.org/wiki/ISO-2022-JP#ISO_2022_Character_Sets

> I.e. in practice (in a context limited to the encoding in question)
> should this be considered as a data loss, or should these strings be
> considered "equivalent"?

Equivalent I would say.  As Unicode they contain the same characters. 
Just differently encoded as bytes.

Ciao,
	Marc 'BlackJack' Rintsch



More information about the Python-list mailing list