unicode(s, enc).encode(enc) == s ?

Wed Jan 2 15:34:22 EST 2008

> Thanks a lot Martin and Marc for the really great explanations! I was
> wondering if it would be reasonable to imagine a utility that will
> determine whether, for a given encoding, two byte strings would be
> equivalent. 

But that is much easier to answer:

  s1.decode(enc) == s2.decode(enc)

Assuming Unicode's unification, for a single encoding, this should
produce correct results in all cases I'm aware of.

If the you also have different encodings, you should add

  def normal_decode(s, enc):
      return unicode.normalize("NFKD", s.decode(enc))

  normal_decode(s1, enc) == normal_decode(s2, enc)

This would flatten out compatibility characters, and ambiguities
left in Unicode itself.

> But I think such a utility will require *extensive*
> knowledge about many bizarrities of many encodings -- and has little
> chance of being pretty!

See above.

> In any case, it goes well beyond the situation that triggered my
> original question in the first place, that basically was to provide a
> reasonable check on whether round-tripping a string is successful --
> this is in the context of a small utility to guess an encoding and to
> use it to decode a byte string. This utility module was triggered by
> one that Skip Montanaro had written some time ago, but I wanted to add
> and combine several ideas and techniques (and support for my usage
> scenarios) for guessing a string's encoding in one convenient place.

Notice that this algorithm is not capable of detecting the ISO-2022
encodings - they look like ASCII to this algorithm. This is by design,
as the encoding was designed to only use 7-bit bytes, so that you can
safely transport them in Email and such (*)

If you want to add support for ISO-2022, you should look for escape
characters, and then check whether the escape sequences are among
the ISO-2022 ones:
- ESC (  - 94-character graphic character set, G0
- ESC )  - 94-character graphic character set, G1
- ESC *  - 94-character graphic character set, G2
- ESC +  - 94-character graphic character set, G3
- ESC -  - 96-character graphic character set, G1
- ESC .  - 96-character graphic character set, G2
- ESC /  - 96-character graphic character set, G3
- ESC $  - Multibyte
           ( G0
           ) G1
           * G2
           + G3
- ESC %   - Non-ISO-2022 (e.g. UTF-8)

If you see any of these, it should be ISO-2022; see
the Wiki page as to what subset may be in use.

G0..G3 means what register the character set is loaded
into; when you have loaded a character set into a register,
you can switch between registers through ^N (to G1),
^O (to G0), ESC n (to G2), ESC o (to G3) (*)

> http://gizmojo.org/code/decodeh/
> 
> I will be very interested in any remarks any of you may have!

>From a shallow inspection, it looks right. I would have spelled
"losses" as "loses".

Regards,
Martin

(*) For completeness: ISO-2022 also supports 8-bit characters,
and there are more control codes to shift between the various
registers.