[Tutor] UnicodeDecodeError while parsing a .csv file.
Steven D'Aprano
steve at pearwood.info
Tue Oct 29 00:49:46 CET 2013
On Mon, Oct 28, 2013 at 06:13:59PM -0400, SM wrote:
> Hello,
> I have an extremely simple piece of code which reads a .csv file, which has
> 1000 lines of fixed fields, one line at a time, and tries to print some
> values.
>
> 1 #!/usr/bin/python3
> 2 #
> 3 import sys, time, re, os
> 4
> 5 if __name__=="__main__":
> 6
> 7 ifd = open("infile.csv", 'r')
By default Python 3 uses UTF-8 when reading files. As the error below
shows, your file actually isn't UTF-8.
What are you using to generate the CSV file? Consult the documentation
for that program and see what it is using. If it has an option to save
using UTF-8, use that.
See below for more discussion.
> 8
> 9 linenum = 0
> 10 for line in ifd:
> 11 line1 = re.split(",", line)
> 12 total = 0
> 13 if linenum == 0:
> 14 linenum = linenum + 1
> 15 continue
[snip many more lines of code]
All of this manual effort is unnecessary, as Python comes standard with
a library to read CSV files. It is much better to use that:
http://docs.python.org/3/library/csv.html
> 31 ifd.close
This line is buggy. To close the file, you need to *call* the close
method by using parentheses, that is, you must write:
ifd.close()
Without the parentheses, you just get a reference to the close methof
but don't do anything with it.
> It works fine till it parses the 1st 126 lines in the input file. For the
> 127th line (irrespective of the contents of the actual line), it prints the
> following error:
> Traceback (most recent call last):
> File "p1.py", line 10, in <module>
> for line in ifd:
> File "/usr/lib/python3.2/codecs.py", line 300, in decode
> (result, consumed) = self._buffer_decode(data, self.errors, final)
> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 1173:
> invalid continuation byte
> $
>
> I am not able to figure out the cause of this error. Any clues as to why I
> am seeing this error, are appreciated.
As mentioned earlier, the error is that the CSV file is not encoded
using UTF-8. Best solution is to go back to the source where the file
comes from and pick the option to always save using UTF-8.
Second best solution is to identify what codec is actually being used.
If you tell us what program generates the CSV file in the first place,
and the operating system you are using (Windows? Mac? Linux?), we might
be able to identify the codec being used.
If you can't identify the codec, you can guess. Guessing is bad, for two
reasons:
- you can waste a lot of time with bad guesses;
- worse, some bad guesses won't give you an error, but will just
give you bad data.
Nevertheless, you can try using a different encoding when you open the
file. Try this:
ifd = open("infile.csv", 'r', encoding='latin-1')
"Latin 1" is an encoding which should not fail, but it might give back
rubbish data. Such rubbish data is often called "moji-bake":
en.wikipedia.org/wiki/Mojibake
Another option is to cover up the errors by passing an error handler:
ifd = open("infile.csv", 'r', errors='replace')
which will replace any undecodable bytes in the file with a "missing
character".
--
Steven
More information about the Tutor
mailing list