PEP 249 Compliant error handling
MRAB
python at mrabarnett.plus.com
Tue Oct 17 16:02:29 EDT 2017
On 2017-10-17 20:25, Israel Brewster wrote:
>
>> On Oct 17, 2017, at 10:35 AM, MRAB <python at mrabarnett.plus.com
>> <mailto:python at mrabarnett.plus.com>> wrote:
>>
>> On 2017-10-17 18:26, Israel Brewster wrote:
>>> I have written and maintain a PEP 249 compliant (hopefully) DB API
>>> for the 4D database, and I've run into a situation where corrupted
>>> string data from the database can cause the module to error out.
>>> Specifically, when decoding the string, I get a "UnicodeDecodeError:
>>> 'utf-16-le' codec can't decode bytes in position 86-87: illegal
>>> UTF-16 surrogate" error. This makes sense, given that the string
>>> data got corrupted somehow, but the question is "what is the proper
>>> way to deal with this in the module?" Should I just throw an error
>>> on bad data? Or would it be better to set the errors parameter to
>>> something like "replace"? The former feels a bit more "proper" to me
>>> (there's an error here, so we throw an error), but leaves the end
>>> user dead in the water, with no way to retrieve *any* of the data
>>> (from that row at least, and perhaps any rows after it as well). The
>>> latter option sort of feels like sweeping the problem under the rug,
>>> but does at least leave an error character in the s
>> tring to
>> l
>>> et them know there was an error, and will allow retrieval of any
>>> good data.
>>> Of course, if this was in my own code I could decide on a
>>> case-by-case basis what the proper action is, but since this a
>>> module that has to work in any situation, it's a bit more complicated.
>> If a particular text field is corrupted, then raising
>> UnicodeDecodeError when trying to get the contents of that field as a
>> Unicode string seems reasonable to me.
>>
>> Is there a way to get the contents as a bytestring, or to get the
>> contents with a different errors parameter, so that the user has the
>> means to fix it (if it's fixable)?
>
> That's certainly a possibility, if that behavior conforms to the DB
> API "standards". My concern in this front is that in my experience
> working with other PEP 249 modules (specifically psycopg2), I'm pretty
> sure that columns designated as type VARCHAR or TEXT are returned as
> strings (unicode in python 2, although that may have been a setting I
> used), not bytes. The other complication here is that the 4D database
> doesn't use the UTF-8 encoding typically found, but rather UTF-16LE,
> and I don't know how well this is documented. So not only is the bytes
> representation completely unintelligible for human consumption, I'm
> not sure the average end-user would know what decoding to use.
>
> In the end though, the main thing in my mind is to maintain
> "standards" compatibility - I don't want to be returning bytes if all
> other DB API modules return strings, or visa-versa for that matter.
> There may be some flexibility there, but as much as possible I want to
> conform to the majority/standard/whatever
>
The average end-user might not know which encoding is being used, but
providing a way to read the underlying bytes will give a more
experienced user the means to investigate and possibly fix it: get the
bytes, figure out what the string should be, update the field with the
correctly decoded string using normal DB instructions.
More information about the Python-list
mailing list