Python 3.x stuffing utf-8 into SQLite db
wxjmfauth at gmail.com
wxjmfauth at gmail.com
Tue Feb 10 03:23:33 EST 2015
Le mardi 10 février 2015 01:37:15 UTC+1, Skip Montanaro a écrit :
> On Mon, Feb 9, 2015 at 2:38 PM, Skip Montanaro <skip.mo... at gmail.com> wrote:
> On Mon, Feb 9, 2015 at 2:05 PM, Zachary Ware
>
> <zachary.w... at gmail.com> wrote:
>
> > If all else fails, you can try ftfy to fix things:
>
> > http://ftfy.readthedocs.org/en/latest/
>
>
>
> Thanks for the pointer. I would prefer to not hand-mangle this stuff
>
> in case I get another database dump from my USMS friends. Something
>
> like ftfy should help things "just work".
>
>
>
> And indeed it did. Thanks Zachary.
>
%%%%%%
ftfy: a mountain of absurdities. On top of this: ~buggy.
Everything works fine if it's done correctly. There is
nothing to fix. I have the feeling you are destroying a
correct data file, and later you try to correct what you have
destroyed.
Basically the same experiment from Matthew Ruffalo:
Office suite --> csv file saved as pd.txt.
>From my GUI interactive interpreter (py32).
>>> with open('pd.csv', encoding='utf-8') as f:
... r = f.read()
...
>>> print(r)
"Patrick's Day A1","Patrick's Day B1","Patrick's Day C1"
"Patrick's Day A2","Patrick's Day C2","Patrick's Day C2"
>>>
Now what may happen, is that the terminal (the host system)
may not display all these chars correctly (Windows, Russion *x, ...).
In that case, one has to code correctly (Windows, Russion *x, ...)
Still with the same GUI interpreter:
>>> sys.stdout.sethostencoding('cp850')
>>> outenc = sys.stdout.encoding
>>> print(r.encode(outenc, 'replace').decode(outenc))
"Patrick?s Day A1","Patrick?s Day B1","Patrick?s Day C1"
"Patrick?s Day A2","Patrick?s Day C2","Patrick?s Day C2"
>>> sys.stdout.sethostencoding('iso-8859-5')
>>> outenc = sys.stdout.encoding
>>> print(r.encode(outenc, 'replace').decode(outenc))
"Patrick?s Day A1","Patrick?s Day B1","Patrick?s Day C1"
"Patrick?s Day A2","Patrick?s Day C2","Patrick?s Day C2"
This is exactly what can be observed in a web browser.
Just for the fun, in fact a no-op.
>>> sys.stdout.sethostencoding('utf-32-le')
>>> outenc = sys.stdout.encoding
>>> print(r.encode(outenc, 'replace').decode(outenc))
"Patrick's Day A1","Patrick's Day B1","Patrick's Day C1"
"Patrick's Day A2","Patrick's Day C2","Patrick's Day C2"
More information about the Python-list
mailing list