I like Unicode more than I used to...

Mon Feb 24 18:59:43 EST 2003

On Sun, 23 Feb 2003 18:57:38 -0600, Skip Montanaro <skip at pobox.com>
wrote:

>    >> >>> f.write("abc\r\n")
>    >> >>> f.write(u"\N{TRADE MARK SIGN}\r\n")
>    >> >>> f.write(u"\u8482\r\n")
>
>    Steven> Incidentally, the trade mark sign was referred to as character
>    Steven> 8482 in that other thread because that's its decimal value:
>
>Yeah, I figured that out, but thought the glyph that was rendered was kinda
>cool...
>
>Skip

With the new _iconv_codec module in python 2.3, things become even
more interesting. 

If you have linked the module against an iconv library that contains
tables for 2-byte encodings you can do magic without the need for
actual, dedicated codecs. In this example I'm using GNU iconv 1.8
(which seems to be the more complete implementation to my knowledge):

Python 2.3a2+ (#6, Feb 24 2003, 17:14:06) # with the cygwin 'fix'
[GCC 3.2 20020927 (prerelease)] on cygwin
Type "help", "copyright", "credits" or "license" for more information.
>>> s = u'\u306e' # Hiragana for the phoneme `no'
>>> s             # string becomes an Unicode object...
u'\u306e'
>>> s.encode('EUC-JP')      # Prefered in Unixland
'\xa4\xce'
>>> s.encode('SHIFT_JIS')   
'\x82\xcc'
>>> s.encode('CP932')       # Preferred in Redmond, WA.
'\x82\xcc'
>>> s.encode('ISO-2022-JP') # Preferred by the imperial aparatchik
'\x1b$B$N'
>>> s.encode('ISO-2022-JP-2')
'\x1b$B$N'
>>> s.encode('ISO-2022-JP-1')
'\x1b$B$N'
>>> s.encode('UCS-2')   # Wha'? Hmm... This is unexpected.
'0n'
>>> s.encode('UCS-4')
'\x00\x000n'
>>> s.encode('UTF-16')
'\xff\xfen0'
>>> s.encode('UTF-32')
'\x00\x00\xfe\xff\x00\x000n'
>>> s.encode('UTF-7')
'+MG4-'
>>> s.encode('UCS-2BE')   # Wha'?
'0n'
>>> s.encode('UCS-2LE')   # Wha'? And the Redmond OS uses this internally?
'n0'
>>> s.encode('EUC-JISX0213')
'\xa4\xce'

... ad nauseam

-- 
Alejandro Lopez-Valencia                          tora no shinden
python -c "print('ZHJhZHVsQDAwN211bmRvLmNvbQ=='.decode('base64'))"