using DictReader() with .decode('utf-8', 'ignore')
Vincent Davis
vincent at vincentdavis.net
Tue Apr 14 10:51:53 EDT 2015
On Tue, Apr 14, 2015 at 7:48 AM, Steven D'Aprano <
steve+comp.lang.python at pearwood.info> wrote:
> with open(dfile, 'rb') as f:
> for line in f:
> try:
> s = line.decode('utf-8', 'strict')
> except UnicodeDecodeError as err:
> print(err)
>
> If you need help deciphering the errors, please copy and paste them here
> and
> we'll see what we can do.
Below are the errors. I knew about these and I think the correct encoding
is windows-1252. I will paste some code and output at the end of this email
that prints the offending column in the line. These are very likely errors,
and so I what to remove them. I am reading this csv into django sqlite3 db.
What is strange to me is that using
"
with open(dfile, 'r', encoding='utf-8', errors='ignore', newline='')
"
does not seem to remove these
, it seems to correctly save them to the db which I don't understand.
'utf-8' codec can't decode byte 0xa6 in position 368: invalid start byte
'utf-8' codec can't decode byte 0xac in position 223: invalid start byte
'utf-8' codec can't decode byte 0xa6 in position 1203: invalid start byte
'utf-8' codec can't decode byte 0xa2 in position 44: invalid start byte
'utf-8' codec can't decode byte 0xac in position 396: invalid start byte
import chardet
with open("DATA/ATSDTA_ATSP600.csv", 'rb') as f:
for line in f:
code = chardet.detect(line)
#if code == {'confidence': 0.5, 'encoding': 'windows-1252'}:
if code != {'encoding': 'ascii', 'confidence': 1.0}:
print(code)
win = line.decode('windows-1252').split(',') #windows-1252
norm = line.decode('utf-8', 'ignore').split(',')
ascii = line.decode('ascii', "ignore").split(',')
ascii2 = line.decode('ISO-8859-1').split(',')
for w, n, a, a2 in zip(win, norm, ascii, ascii2):
if w != n:
print(w
)
print(
n
)
a, a2)
print(win[0])
## Output
{'encoding': 'windows-1252', 'confidence': 0.5}
"¦ " " " " " "¦ "
"040543"
{'encoding': 'windows-1252', 'confidence': 0.5}
"LEASE GREGPRU D ¬ETERSPM " "LEASE GREGPRU D ETERSPM
" "LEASE GREGPRU D ETERSPM " "LEASE
GREGPRU D ¬ETERSPM "
"979643"
{'encoding': 'windows-1252', 'confidence': 0.5}
"¦ " " " " " "¦ "
"986979"
{'encoding': 'windows-1252', 'confidence': 0.5}
"WELLS FARGO &¢ COMPANY " "WELLS FARGO & COMPANY
" "WELLS FARGO & COMPANY " "WELLS
FARGO &¢ COMPANY "
"994946"
{'encoding': 'windows-1252', 'confidence': 0.5}
OSSOSSO¬¬O " OSSOSSOO " OSSOSSOO " OSSOSSO¬¬O "
"996535"
Vincent Davis
720-301-3003
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20150414/47ec8b14/attachment.html>
More information about the Python-list
mailing list