Python 3.x stuffing utf-8 into SQLite db

Mon Feb 9 12:40:14 EST 2015

On Tue, Feb 10, 2015 at 4:30 AM, Skip Montanaro
<skip.montanaro at gmail.com> wrote:
> On Sun, Feb 8, 2015 at 9:58 PM, Chris Angelico <rosuav at gmail.com> wrote:
>> Those three characters are the CP-1252 decode of the bytes for U+2019
>> in UTF-8 (E2 80 99). Not sure if that helps any, but given that it was
>> an XLSX file, Windows codepages are reasonably likely to show up.
>
> Thanks, Chris. Are you telling me I should have defined the input file
> encoding for my CSV file as CP-1252, or that something got hosed on
> the export from XLSX to CSV? Or something else?
>
> Skip

Well, I'm not entirely sure. If your input file is actually CP-1252
and you try to decode it as UTF-8, you'll almost certainly get an
error (unless of course it's all ASCII, but you know it isn't in this
case). Also, I'd say chardet will be correct. But it might be worth
locating one of those apostrophes in the file and looking at the
actual bytes representing it... because what you may have is a crazy
double-encoded system. If you take a document with U+2019 in it and
encode it UTF-8, then decode it as CP-1252, then re-encode as UTF-8,
you could get that. (I think. Haven't actually checked.) If someone
gave UTF-8 bytes to a program that doesn't know the difference between
bytes and characters, and assumes CP-1252, then you might well get
something like this. Hence, having a look at the exact bytes in the
.CSV file may help.

Easiest might be to pull it up in a hex viewer (I use 'hd' on my
Debian systems), and grep for the critical line. Otherwise, use Python
and try to pull out a line from the byte stream.

Good luck. You may need it.

ChrisA