String character encoding when converting data from one type/format to another
rosuav at gmail.com
Wed Jan 7 13:20:16 CET 2015
On Wed, Jan 7, 2015 at 11:02 PM, Ned Batchelder <ned at nedbatchelder.com> wrote:
>> Any thoughts on a sort of generic method/means to handle any/all
>> characters that might be out of range when having pulled them out of
>> something like these MS access databases?
> The best thing is to know what encoding was used to produce these byte
> values. Then you can manipulate them as Unicode if you need to. The second
> best thing is to simply pass them through as bytes.
If you can't know for sure, you could hazard a guess. There's a good
chance that an eight-bit encoding from a Microsoft product is CP-1252.
In fact, when I interoperate with Unicode-unaware Windows programs, I
usually attempt a UTF-8 decode, and if that fails, I simply assume
CP-1252; this generally gives correct results for data coming from
US-English Windows users.
Jacob, have a look at your data. Contextually, would the '\xa3' be
likely to be a pound sign, £? Would '\x85' make sense as an ellipsis?
Would \x90, \x91, \x92, and \x93 seem to be used for quote marks? If
so, CP-1252 would be the encoding to use.
More information about the Python-list