Determining Unicode encoding.
"Martin v. Löwis"
martin at v.loewis.de
Tue Apr 29 17:02:13 EDT 2003
Sean wrote:
> The problem is that I don't know how to determine what
> the *right* encoding to use on a particular string is.
Do you have this problem when reading a byte string, or when
writing it?
If you are given a byte string, and you are supposed to interpret
the bytes as characters, there is, in general, no good way to do
so - that's why people came up with the idea of a universal character
set in the first place, to overcome the problems with multiple
character sets.
That said, you can make educated guesses on the data you read.
1. Perhaps the data you read has some file format which specifies
the encoding, or allows parametrization, such as XML or HTML.
You will need to look *into* the file to find out what its
encoding is.
2. Perhaps the data has some fixed encoding, as part of the file
format specification. For many files, this is US-ASCII.
3. Perhaps this is a plain text file, and you should use the encoding
that the user's text editor is most likely to use (of course, you
don't know what text editor the user uses, nor what encoding that
editor uses). locale.getdefaultlocale()[1] offers you some guess;
python 2.3's locale.getpreferredencoding() gives a better guess.
> Is there any way to
> determine, from the unicode string itself, what encoding I need to use
> to prevent data loss?
That sounds you have the problem when *writing* Unicode strings.
In that case, you can invoke .encode: it will give a UnicodeError if
the encoding is not supported. At some point, you need to make up your
mind what encoding to use for a certain file - if you then get an error,
all you can do is to inform the user, and
a) perhaps ignore the bad characters, replacing them with appropriate
replacement characters (usually '?'), or
b) go back and recode the output so far in a different encoding.
> Am I even asking the right questions? I'm really pretty lost and my
> O'Reilly books arn't helping very much.
Don't worry. These things are inherently difficult. Organizations like
W3C have essentially given up, and say that XML is UTF-8 by default
(knowing that this will support arbitrary characters). If people
absolutely want XML in different encodings, they can do that, but they
are left alone with the issue of encoding unsupported characters
(for XML, they can actually use character references).
You will have to make explicit choices: either support only UTF-8
(and accept that it will be tedious for some users to produce the proper
files), or support arbitrary encodings (and accept that some encodings
cannot represent all characters, and that you may not have the codecs
available to read the data, and that a mechanism must be provided to
determine the encooding), or support only a few non-UTF-8 encodings
(restricting the data format to a subset of all living languages).
Regards,
Martin
More information about the Python-list
mailing list