String character encoding when converting data from one type/format to another
Jacob Kruger
jacob at blindza.co.za
Wed Jan 7 08:41:10 EST 2015
Thanks.
Yes, sorry didn't mention 2.7, and, unfortunately in this sense, all of this
will be running on windows machines.
Stay well
Jacob Kruger
Blind Biker
Skype: BlindZA
"Roger Wilco wants to welcome you...to the space janitor's closet..."
----- Original Message -----
From: "Dave Angel" <davea at davea.name>
To: <python-list at python.org>
Sent: Wednesday, January 07, 2015 2:22 PM
Subject: Re: String character encoding when converting data from one
type/format to another
> On 01/07/2015 06:04 AM, Jacob Kruger wrote:
>> I'm busy using something like pyodbc to pull data out of MS access .mdb
>> files, and then generate .sql script files to execute
>
> against MySQL databases using MySQLdb module, but, issue is forms of
> characters in string values that don't fit inside
>
> the 0-127 range - current one seems to be something like \xa3, and if I
> pass it through ord() function,
>
> it comes out as character number 163.
>
> First question, of course is what version of Python. Clearly, you're not
> using Python 3.x, so I'll assume 2.7. But you really should specify it in
> your query.
>
> Next question is what OS you're using. You're reading .mdb files, which
> are most likely created in Windows, but that doesn't guarantee you're
> actually using Windows to do this conversion.
>
>
>>
>> Now issue is, yes, could just run through the hundreds of thousands of
>> characters in these resulting strings, and strip out any that are not
>> within the basic 0-127 range, but, that could result in corrupting data -
>> think so anyway.
>>
>> Anyway, issue is, for example, if I try something like
>> str('\xa3').encode('utf-8') or str('\xa3').encode('ascii'), or
>> str('\xa3').encode('latin7') - that last one is actually our preferred
>> encoding for the MySQL database - they all just tell me they can't work
>> with a character out of range.
>>
>
> That's because your input data isn't ASCII. So before you encode it, you
> have to decode it. Any idea what encoding it's already in? Maybe it's in
> latin1, which permits all 256 values. Or utf-8, which permits a few
> hundred thousand values, but uses multiple bytes for any of those over
> 127. Or any of hundreds of other encodings.
>
> Does an .mdb file have a field specifying what encoding was used? Or do
> you have some other external knowledge?
>
> If you don't know what encoding it's currently in, you'll have to guess,
> and the guess you're using so far is ASCII, which you know is false.
>
> As for the encoding you should actually use in the database, that almost
> certainly ought to be utf-8, which supports far more international
> characters than latin1. And make sure the database has a way to tell the
> future user what encoding you picked.
>
>> Any thoughts on a sort of generic method/means to handle any/all
>> characters that might be out of range when having pulled them out of
>> something like these MS access databases?
>
> The only invalid characters are those which aren't valid in the encoding
> used. Those can probably be safely converted to "?" or something similar.
>
>>
>> Another side note is for binary values that might store binary values, I
>> use something like the following to generate hex-based strings that work
>> alright when then inserting said same binary values into longblob fields,
>> but, don't think this would really help for what are really just most
>> likely badly chosen copy/pasted strings from documents, with strange
>> encoding, or something:
>> #sample code line for binary encoding into string output
>> s_values += "0x" + str(l_data[J][I]).encode("hex").replace("\\", "\\\\")
>> + ", "
>>
>
> Best to not pretend they're text at all. But if your db doesn't support
> binary blobs, then use an encoding which supports all 256 values
> unambiguously, while producing printable characters. Like uuencod, using
> module uu
>
> You might also look into mime, where you store the encoding of the data
> with the data. See for example mimetypes.
>
>
>
> --
> DaveA
> --
> https://mail.python.org/mailman/listinfo/python-list
>
More information about the Python-list
mailing list