using DictReader() with .decode('utf-8', 'ignore')

Vincent Davis vincent at vincentdavis.net
Tue Apr 14 16:51:53 CEST 2015


On Tue, Apr 14, 2015 at 7:48 AM, Steven D'Aprano <
steve+comp.lang.python at pearwood.info> wrote:

> with open(dfile, 'rb') as f:
>     for line in f:
>         try:
>             s = line.decode('utf-8', 'strict')
>         except UnicodeDecodeError as err:
>             print(err)
>
> If you need help deciphering the errors, please copy and paste them here
> and
> we'll see what we can do.


Below are the errors. I knew about these and I think the correct encoding
is windows-1252. I will paste some code and output at the end of this email
that prints the offending column in the line. These are very likely errors,
and so I what to remove them. I am reading this csv into django sqlite3 db.
What is strange to me is that using
​"​
with open(dfile, 'r', encoding='utf-8', errors='ignore', newline='')
​"​
 does not seem to remove these
​, it seems to correctly save them to the db which I don't understand.​
​


'utf-8' codec can't decode byte 0xa6 in position 368: invalid start byte
'utf-8' codec can't decode byte 0xac in position 223: invalid start byte
'utf-8' codec can't decode byte 0xa6 in position 1203: invalid start byte
'utf-8' codec can't decode byte 0xa2 in position 44: invalid start byte
'utf-8' codec can't decode byte 0xac in position 396: invalid start byte

import chardet
with open("DATA/ATSDTA_ATSP600.csv", 'rb') as f:
    for line in f:
        code = chardet.detect(line)
        #if code == {'confidence': 0.5, 'encoding': 'windows-1252'}:
        if code != {'encoding': 'ascii', 'confidence': 1.0}:
            print(code)
        win = line.decode('windows-1252').split(',') #windows-1252
        norm = line.decode('utf-8', 'ignore').split(',')
        ascii = line.decode('ascii', "ignore").split(',')
        ascii2 = line.decode('ISO-8859-1').split(',')

        for w, n, a, a2 in zip(win, norm, ascii, ascii2):
            if w != n:
                print(w
​)
​             print(
n
​)
​
a, a2)
                print(win[0])

​## Output​

{'encoding': 'windows-1252', 'confidence': 0.5}
"¦   " "   " "   " "¦   "
"040543"
{'encoding': 'windows-1252', 'confidence': 0.5}
"LEASE GREGPRU D ¬ETERSPM                 " "LEASE GREGPRU D ETERSPM
              " "LEASE GREGPRU D ETERSPM                 " "LEASE
GREGPRU D ¬ETERSPM                 "
"979643"
{'encoding': 'windows-1252', 'confidence': 0.5}
"¦   " "   " "   " "¦   "
"986979"
{'encoding': 'windows-1252', 'confidence': 0.5}
"WELLS FARGO &¢ COMPANY                   " "WELLS FARGO & COMPANY
              " "WELLS FARGO & COMPANY                   " "WELLS
FARGO &¢ COMPANY                   "
"994946"
{'encoding': 'windows-1252', 'confidence': 0.5}
OSSOSSO¬¬O         " OSSOSSOO         " OSSOSSOO         " OSSOSSO¬¬O         "
"996535"



Vincent Davis
720-301-3003
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20150414/47ec8b14/attachment.html>


More information about the Python-list mailing list