Determining Unicode encoding.

Sean sean at activeprime.com
Tue Apr 29 16:42:40 EDT 2003


I'm really new to dealing with unicode, so please bear with me.  I'm
trying to add unicode support to a program I'm working on, and I'm
getting stuck a little when printing a unicode string to a file.  I
know I have to encode the string using an encoding (UTF-8, UTF-16,
latin-1, etc).  The problem is that I don't know how to determine what
the *right* encoding to use on a particular string is.  The way I
understand it, utf-8 will handle any unicode data, but it will
translate characters not in the standard ASCII set to fit within the
8-bit character table.  My problem is I'm handling data from a lot of
different encodings (latin, eastern, asian, etc) and I can't allow
data in the strings to be changed.  I also can't (at least I don't
know how to) determine what encodings the strings are using.  IE, I
don't know what strings are from what languages.  Is there any way to
determine, from the unicode string itself, what encoding I need to use
to prevent data loss?   Or do I need to find a way to determine
beforehand what encoding they are using when they are read in?

Am I even asking the right questions?  I'm really pretty lost and my
O'Reilly books arn't helping very much.

-Sean




More information about the Python-list mailing list