[Tutor] UnicodeDecodeError while parsing a .csv file.

Tue Oct 29 00:49:46 CET 2013

On Mon, Oct 28, 2013 at 06:13:59PM -0400, SM wrote:
> Hello,
> I have an extremely simple piece of code which reads a .csv file, which has
> 1000 lines of fixed fields, one line at a time, and tries to print some
> values.
> 
>   1 #!/usr/bin/python3
>   2 #
>   3 import sys, time, re, os
>   4
>   5 if __name__=="__main__":
>   6
>   7     ifd = open("infile.csv", 'r')

By default Python 3 uses UTF-8 when reading files. As the error below 
shows, your file actually isn't UTF-8.

What are you using to generate the CSV file? Consult the documentation 
for that program and see what it is using. If it has an option to save 
using UTF-8, use that. 

See below for more discussion.

>   8
>   9     linenum = 0
>  10     for line in ifd:
>  11         line1 = re.split(",", line)
>  12         total = 0
>  13         if linenum == 0:
>  14             linenum = linenum + 1
>  15             continue
[snip many more lines of code]

All of this manual effort is unnecessary, as Python comes standard with 
a library to read CSV files. It is much better to use that:

http://docs.python.org/3/library/csv.html

>  31     ifd.close

This line is buggy. To close the file, you need to *call* the close 
method by using parentheses, that is, you must write:

ifd.close()

Without the parentheses, you just get a reference to the close methof 
but don't do anything with it.

> It works fine till  it parses the 1st 126 lines in the input file. For the
> 127th line (irrespective of the contents of the actual line), it prints the
> following error:
> Traceback (most recent call last):
>   File "p1.py", line 10, in <module>
>     for line in ifd:
>   File "/usr/lib/python3.2/codecs.py", line 300, in decode
>     (result, consumed) = self._buffer_decode(data, self.errors, final)
> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 1173:
> invalid continuation byte
> $
> 
> I am not able to figure out the cause of this error. Any clues as to why I
> am seeing this error, are appreciated.

As mentioned earlier, the error is that the CSV file is not encoded 
using UTF-8. Best solution is to go back to the source where the file 
comes from and pick the option to always save using UTF-8.

Second best solution is to identify what codec is actually being used. 
If you tell us what program generates the CSV file in the first place, 
and the operating system you are using (Windows? Mac? Linux?), we might 
be able to identify the codec being used.

If you can't identify the codec, you can guess. Guessing is bad, for two 
reasons:

- you can waste a lot of time with bad guesses;

- worse, some bad guesses won't give you an error, but will just 
  give you bad data.

Nevertheless, you can try using a different encoding when you open the 
file. Try this:

ifd = open("infile.csv", 'r', encoding='latin-1')

"Latin 1" is an encoding which should not fail, but it might give back 
rubbish data. Such rubbish data is often called "moji-bake":

en.wikipedia.org/wiki/Mojibake

Another option is to cover up the errors by passing an error handler:

ifd = open("infile.csv", 'r', errors='replace')

which will replace any undecodable bytes in the file with a "missing 
character".

-- 
Steven