Determining Unicode encoding.

Tue Apr 29 17:30:21 EDT 2003

    Sean> Is there any way to determine, from the unicode string itself,
    Sean> what encoding I need to use to prevent data loss?

If by "unicode string" you mean one of Python's unicode objects, there's no
need.  It's stored internally in a way which doesn't require you tack a
specific encoding onto it for internal use.  When you write it out, you have
to figure out what encoding is appropriate for that particular output device
(a file which will be consumed by another program, a terminal which supports
a limited number of encodings, etc).  In general, if you are the only
consumer of the file you're producing, I'd simply encode the object as
utf-8.

If by "unicode string" you mean a series of bytes input to your program from
some unknown source in some unknown encoding, you have your work cut out for
you.  Depending on what encodings are used in the strings input to your
program and how you get them, you may or may not reliably know the encoding.
For example, if you're sucking pages off a web server, it will probably tell
you what the encoding is (presuming whatever tool was used to generate the
page encoded things properly).  If you're just being fed random files with
no encoding information, you have to apply some heuristics to the problem.
The various encodings related to iso-8859-* (including the various Microsoft
125* code pages) overlap so much that it can be a challenge to get things
correct.  On the other hand, as I understand it, the various common Japanese
encodings tend to not overlap much, if at all, so you can pretty much just
try decoding using the various possibilities in any order and quit with the
first one which succeeds.

I use this function in my code:

    def decode_heuristically(s, enc=None, denc=sys.getdefaultencoding()):
	"""try interpreting s using several possible encodings.
	return value is a three-element tuple.  The first element is either an
	ASCII string or a Unicode object.  The second element is 1
	if the decoder had to punt and delete some characters from the input
	to successfully generate a Unicode object."""
	if isinstance(s, unicode):
	    return s, 0, "utf-8"
	try:
	    x = unicode(s, "ascii")
	    # if it's ascii, we're done
	    return s, 0, "ascii"
	except UnicodeError:
	    encodings = ["utf-8","iso-8859-1","cp1252","iso-8859-15"]
	    # if the default encoding is not ascii it's a good thing to try
	    if denc != "ascii": encodings.insert(0, denc)
	    # always try any caller-provided encoding first
	    if enc: encodings.insert(0, enc)
	    for enc in encodings:

		# Most of the characters between 0x80 and 0x9F are displayable
		# in cp1252 but are control characters in iso-8859-1.  Skip
		# iso-8859-1 if they are found, even though the unicode() call
		# might well succeed.

		if (enc in ("iso-8859-15", "iso-8859-1") and
		    re.search(r"[\x80-\x9f]", s) is not None):
		    continue

		# Characters in the given range are more likely to be 
		# symbols used in iso-8859-15, so even though unicode()
		# may accept such strings with those encodings, skip them.

		if (enc in ("iso-8859-1", "cp1252") and
		    re.search(r"[\xa4\xa6\xa8\xb4\xb8\xbc-\xbe]", s) is not None):
		    continue

		try:
		    x = unicode(s, enc)
		except UnicodeError:
		    pass
		else:
		    if x.encode(enc) == s:
			return x, 0, enc

	    # nothing worked perfectly - try again, but use the "ignore" parameter
	    # and return the longest result
	    output = [(unicode(s, enc, "ignore"), enc) for enc in encodings]
	    output = [(len(x), x) for x in output]
	    output.sort()
	    x, enc = output[-1][1]
	    return x, 1, enc

Note that I deal almost exclusively with ASCII, but that Microsoft encodings
and the occasional Latin-1 stuff which creep in give me problems.
Everything which goes into my database gets utf-8-encoded.  Most of my
inputs come from web form submissions, which frequently seem to have no
encoding information or specify the encoding incorrectly.

The above function resulted from a huge amount of agony on my part, coupled
with a fair amount of feedback from Martin von Löwis.  It could probably
still stand some refinement (I don't recall if I ever incorporated Martin's
last comments on the topic).

    Sean> Or do I need to find a way to determine beforehand what encoding
    Sean> they are using when they are read in?

If you can find that out reliably, it's much better than guessing like I do
above.  The closer to the source of the data you can look for encoding
information, the better chance you'll find something, but you still have to
be prepared to punt.  I still have cp1252 stuff creeping into my database on
occasion, so each night a cron job dumps the database to flat files and
checks for stuff that snuck through.  (I've obviously missed heuristically
decoding some inputs, but the system I'm maintaining is about eight years
old and only gets a small amount of attention these days.)

Skip