String character encoding when converting data from one type/format to another

Jacob Kruger jacob at blindza.co.za
Wed Jan 7 14:41:10 CET 2015


Thanks.

Yes, sorry didn't mention 2.7, and, unfortunately in this sense, all of this 
will be running on windows machines.

Stay well

Jacob Kruger
Blind Biker
Skype: BlindZA
"Roger Wilco wants to welcome you...to the space janitor's closet..."

----- Original Message ----- 
From: "Dave Angel" <davea at davea.name>
To: <python-list at python.org>
Sent: Wednesday, January 07, 2015 2:22 PM
Subject: Re: String character encoding when converting data from one 
type/format to another


> On 01/07/2015 06:04 AM, Jacob Kruger wrote:
>> I'm busy using something like pyodbc to pull data out of MS access .mdb 
>> files, and then generate .sql script files to execute
>
>  against MySQL databases using MySQLdb module, but, issue is forms of 
> characters in string values that don't fit inside
>
>  the 0-127 range - current one seems to be something like \xa3, and if I 
> pass it through ord() function,
>
>  it comes out as character number 163.
>
> First question, of course is what version of Python.  Clearly, you're not 
> using Python 3.x, so I'll assume 2.7.  But you really should specify it in 
> your query.
>
> Next question is what OS you're using.  You're reading .mdb files, which 
> are most likely created in Windows, but that doesn't guarantee you're 
> actually using Windows to do this conversion.
>
>
>>
>> Now issue is, yes, could just run through the hundreds of thousands of 
>> characters in these resulting strings, and strip out any that are not 
>> within the basic 0-127 range, but, that could result in corrupting data - 
>> think so anyway.
>>
>> Anyway, issue is, for example, if I try something like 
>> str('\xa3').encode('utf-8') or str('\xa3').encode('ascii'), or 
>> str('\xa3').encode('latin7') - that last one is actually our preferred 
>> encoding for the MySQL database - they all just tell me they can't work 
>> with a character out of range.
>>
>
> That's because your input data isn't ASCII.  So before you encode it, you 
> have to decode it.  Any idea what encoding it's already in?  Maybe it's in 
> latin1, which permits all 256 values.  Or utf-8, which permits a few 
> hundred thousand values, but uses multiple bytes for any of those over 
> 127.  Or any of hundreds of other encodings.
>
> Does an .mdb file have a field specifying what encoding was used?  Or do 
> you have some other external knowledge?
>
> If you don't know what encoding it's currently in, you'll have to guess, 
> and the guess you're using so far is ASCII, which you know is false.
>
> As for the encoding you should actually use in the database, that almost 
> certainly ought to be utf-8, which supports far more international 
> characters than latin1.  And make sure the database has a way to tell the 
> future user what encoding you picked.
>
>> Any thoughts on a sort of generic method/means to handle any/all 
>> characters that might be out of range when having pulled them out of 
>> something like these MS access databases?
>
> The only invalid characters are those which aren't valid in the encoding 
> used.  Those can probably be safely converted to "?" or something similar.
>
>>
>> Another side note is for binary values that might store binary values, I 
>> use something like the following to generate hex-based strings that work 
>> alright when then inserting said same binary values into longblob fields, 
>> but, don't think this would really help for what are really just most 
>> likely badly chosen copy/pasted strings from documents, with strange 
>> encoding, or something:
>> #sample code line for binary encoding into string output
>> s_values += "0x" + str(l_data[J][I]).encode("hex").replace("\\", "\\\\") 
>> + ", "
>>
>
> Best to not pretend they're text at all.  But if your db doesn't support 
> binary blobs, then use an encoding which supports all 256 values 
> unambiguously, while producing printable characters.  Like uuencod, using 
> module uu
>
> You might also look into mime, where you store the encoding of the data 
> with the data.  See for example mimetypes.
>
>
>
> -- 
> DaveA
> -- 
> https://mail.python.org/mailman/listinfo/python-list
> 




More information about the Python-list mailing list