[Tutor] Not-quite-unicode string, how to convert to ascii?

Jeff Kowalczyk jtk at yahoo.com
Mon Apr 19 13:34:14 EDT 2004


I have string input in some strange encoding, some editors (Win32 TextPad)
pick it up as Unicode, linux gedit doesn't recognize the encoding and
won't load it as utf-8. Python's encode('ascii') doesn't even alter the
string.

I can see that it is double byte, but which sub-encoding, I have no idea.

Source reads '  Batch   '
>>> f = open('input.txt','r')
>>> s = f.read()
>>> s[:20]
' \x00 \x00B\x00a\x00t\x00c\x00h\x00 \x00 \x00 \x00'

I was almost tempted to just iterate over the raw string and remove
'\x00' and leave it at that. The input files are about 180kb in size.

Can anyone suggest a way to convert the DBCS input to plain ascii? Thanks.




More information about the Tutor mailing list