<div dir="ltr"><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Apr 14, 2015 at 7:48 AM, Steven D'Aprano <span dir="ltr"><<a href="mailto:steve+comp.lang.python@pearwood.info" target="_blank">steve+comp.lang.python@pearwood.info</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><span class="">with open(dfile, 'rb') as f:<br>
for line in f:<br>
</span> try:<br>
s = line.decode('utf-8', 'strict')<br>
except UnicodeDecodeError as err:<br>
print(err)<br>
<br>
If you need help deciphering the errors, please copy and paste them here and<br>
we'll see what we can do.</blockquote></div><br>Below are the errors. I knew about these and I think the correct encoding is windows-1252. I will paste some code and output at the end of this email that prints the offending column in the line. These are very likely errors, and so I what to remove them. I am reading this csv into django sqlite3 db. What is strange to me is that using <div class="gmail_default" style="font-family:verdana,sans-serif;font-size:small;display:inline">"</div>with open(dfile, 'r', encoding='utf-8', errors='ignore', newline='') <div class="gmail_default" style="font-family:verdana,sans-serif;font-size:small;display:inline">"</div> does not seem to remove these<div class="gmail_default" style="font-family:verdana,sans-serif;font-size:small;display:inline">, it seems to correctly save them to the db which I don't understand.</div><div class="gmail_default" style="font-family:verdana,sans-serif;font-size:small;display:inline"></div><br><br>'utf-8' codec can't decode byte 0xa6 in position 368: invalid start byte<br>'utf-8' codec can't decode byte 0xac in position 223: invalid start byte<br>'utf-8' codec can't decode byte 0xa6 in position 1203: invalid start byte<br>'utf-8' codec can't decode byte 0xa2 in position 44: invalid start byte<br>'utf-8' codec can't decode byte 0xac in position 396: invalid start byte</div><div class="gmail_extra"><br></div><div class="gmail_extra"><div class="gmail_extra">import chardet</div><div class="gmail_extra">with open("DATA/ATSDTA_ATSP600.csv", 'rb') as f:</div><div class="gmail_extra"> for line in f:</div><div class="gmail_extra"> code = chardet.detect(line)</div><div class="gmail_extra"> #if code == {'confidence': 0.5, 'encoding': 'windows-1252'}:</div><div class="gmail_extra"> if code != {'encoding': 'ascii', 'confidence': 1.0}:</div><div class="gmail_extra"> print(code)</div><div class="gmail_extra"> win = line.decode('windows-1252').split(',') #windows-1252</div><div class="gmail_extra"> norm = line.decode('utf-8', 'ignore').split(',')</div><div class="gmail_extra"> ascii = line.decode('ascii', "ignore").split(',')</div><div class="gmail_extra"> ascii2 = line.decode('ISO-8859-1').split(',')</div><div class="gmail_extra"> </div><div class="gmail_extra"> for w, n, a, a2 in zip(win, norm, ascii, ascii2):</div><div class="gmail_extra"> if w != n:</div><div class="gmail_extra"> print(w<div class="gmail_default" style="font-family:verdana,sans-serif;font-size:small;display:inline">)</div></div><div class="gmail_extra"><div class="gmail_default" style="font-family:verdana,sans-serif;font-size:small;display:inline"> print(</div>n<div class="gmail_default" style="font-family:verdana,sans-serif;font-size:small;display:inline">)</div></div><div class="gmail_extra"><div class="gmail_default" style="font-family:verdana,sans-serif;font-size:small;display:inline"></div> a, a2)</div><div class="gmail_extra"> print(win[0])</div><div class="gmail_extra"><br></div><div class="gmail_extra"><div class="gmail_default" style="font-family:verdana,sans-serif;font-size:small">## Output</div><br></div><div class="gmail_extra"><pre style="overflow:auto;font-size:14px;padding:0px;margin-top:0px;margin-bottom:0px;line-height:17.0000591278076px;word-break:break-all;word-wrap:break-word;color:rgb(0,0,0);border:0px;border-radius:0px;white-space:pre-wrap;vertical-align:baseline">{'encoding': 'windows-1252', 'confidence': 0.5}
"¦ " " " " " "¦ "
"040543"
{'encoding': 'windows-1252', 'confidence': 0.5}
"LEASE GREGPRU D ¬ETERSPM " "LEASE GREGPRU D ETERSPM " "LEASE GREGPRU D ETERSPM " "LEASE GREGPRU D ¬ETERSPM "
"979643"
{'encoding': 'windows-1252', 'confidence': 0.5}
"¦ " " " " " "¦ "
"986979"
{'encoding': 'windows-1252', 'confidence': 0.5}
"WELLS FARGO &¢ COMPANY " "WELLS FARGO & COMPANY " "WELLS FARGO & COMPANY " "WELLS FARGO &¢ COMPANY "
"994946"
{'encoding': 'windows-1252', 'confidence': 0.5}
OSSOSSO¬¬O " OSSOSSOO " OSSOSSOO " OSSOSSO¬¬O "
"996535"</pre></div><br><br>Vincent Davis<br>720-301-3003
</div></div>