String character encoding when converting data from one type/format to another
jacob at blindza.co.za
Wed Jan 7 14:32:00 CET 2015
Yes, using python 2.7, and all you said makes sense, so will check out the
talk, and the byte'ing, etc. (yes, bad joke, I know)
Issue with knowing encoding could just be that am pretty sure at least some
of the data capture is done via copy/paste from one MS app to another, which
could possibly result in a whole bunch of different character sets, etc.
being copied across, so it comes down to that while can't control sources of
data, need to manipulate/work with it to make it useful on our side now.
"Roger Wilco wants to welcome you...to the space janitor's closet..."
----- Original Message -----
From: "Ned Batchelder" <ned at nedbatchelder.com>
To: <python-list at python.org>
Sent: Wednesday, January 07, 2015 2:02 PM
Subject: Re: String character encoding when converting data from one
type/format to another
> On 1/7/15 6:04 AM, Jacob Kruger wrote:
>> I'm busy using something like pyodbc to pull data out of MS access .mdb
>> files, and then generate .sql script files to execute against MySQL
>> databases using MySQLdb module, but, issue is forms of characters in
>> string values that don't fit inside the 0-127 range - current one seems
>> to be something like \xa3, and if I pass it through ord() function, it
>> comes out as character number 163.
>> Now issue is, yes, could just run through the hundreds of thousands of
>> characters in these resulting strings, and strip out any that are not
>> within the basic 0-127 range, but, that could result in corrupting data
>> - think so anyway.
> That will definitely corrupt your data, since you will be discarding data.
>> Anyway, issue is, for example, if I try something like
>> str('\xa3').encode('utf-8') or str('\xa3').encode('ascii'), or
>> str('\xa3').encode('latin7') - that last one is actually our preferred
>> encoding for the MySQL database - they all just tell me they can't work
>> with a character out of range.
> Are you using Python 2 or Python 3? This is one area where the two are
> very different. I suspect you are on Python 2, in which case these all
> fail the same way because you are calling encode on a bytestring. You
> can't encode a bytestring, you can only encode a Unicode string, so encode
> is helpfully trying to decode your bytestring first, using the default
> encoding (ascii), and '\xa3' is not an ascii character.
> If that was confusing, this talk covers these fundamentals:
> http://bit.ly/unipain .
>> Any thoughts on a sort of generic method/means to handle any/all
>> characters that might be out of range when having pulled them out of
>> something like these MS access databases?
> The best thing is to know what encoding was used to produce these byte
> values. Then you can manipulate them as Unicode if you need to. The
> second best thing is to simply pass them through as bytes.
>> Another side note is for binary values that might store binary values, I
>> use something like the following to generate hex-based strings that work
>> alright when then inserting said same binary values into longblob
>> fields, but, don't think this would really help for what are really just
>> most likely badly chosen copy/pasted strings from documents, with
>> strange encoding, or something:
>> #sample code line for binary encoding into string output
>> s_values += "0x" + str(l_data[J][I]).encode("hex").replace("\\", "\\\\")
>> + ", "
>> Jacob Kruger
>> Blind Biker
>> Skype: BlindZA
>> "Roger Wilco wants to welcome you...to the space janitor's closet..."
> Ned Batchelder, http://nedbatchelder.com
More information about the Python-list