[Tutor] UnicodeDecodeError while parsing a .csv file.

Tue Oct 29 02:35:42 CET 2013

Hi Steven, Thanks, very much, for the very detailed reply. It was very
useful.
This is just a utility script to read some sentiment analysis data to
manipulate the positive and negative sentiments of multiple people into a
single sentiment per line. So the data I got was from some public domain
which I have no control over. What worked was your suggestion to ignore the
errors (I made sure that my results are not messed up when I choose to
ignore the errors).
Thanks, much.

On Mon, Oct 28, 2013 at 7:49 PM, Steven D'Aprano <steve at pearwood.info>wrote:

> On Mon, Oct 28, 2013 at 06:13:59PM -0400, SM wrote:
> > Hello,
> > I have an extremely simple piece of code which reads a .csv file, which
> has
> > 1000 lines of fixed fields, one line at a time, and tries to print some
> > values.
> >
> >   1 #!/usr/bin/python3
> >   2 #
> >   3 import sys, time, re, os
> >   4
> >   5 if __name__=="__main__":
> >   6
> >   7     ifd = open("infile.csv", 'r')
>
> By default Python 3 uses UTF-8 when reading files. As the error below
> shows, your file actually isn't UTF-8.
>
> What are you using to generate the CSV file? Consult the documentation
> for that program and see what it is using. If it has an option to save
> using UTF-8, use that.
>
> See below for more discussion.
>
>
> >   8
> >   9     linenum = 0
> >  10     for line in ifd:
> >  11         line1 = re.split(",", line)
> >  12         total = 0
> >  13         if linenum == 0:
> >  14             linenum = linenum + 1
> >  15             continue
> [snip many more lines of code]
>
> All of this manual effort is unnecessary, as Python comes standard with
> a library to read CSV files. It is much better to use that:
>
> http://docs.python.org/3/library/csv.html
>
> >  31     ifd.close
>
> This line is buggy. To close the file, you need to *call* the close
> method by using parentheses, that is, you must write:
>
> ifd.close()
>
>
> Without the parentheses, you just get a reference to the close methof
> but don't do anything with it.
>
>
> > It works fine till  it parses the 1st 126 lines in the input file. For
> the
> > 127th line (irrespective of the contents of the actual line), it prints
> the
> > following error:
> > Traceback (most recent call last):
> >   File "p1.py", line 10, in <module>
> >     for line in ifd:
> >   File "/usr/lib/python3.2/codecs.py", line 300, in decode
> >     (result, consumed) = self._buffer_decode(data, self.errors, final)
> > UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position
> 1173:
> > invalid continuation byte
> > $
> >
> > I am not able to figure out the cause of this error. Any clues as to why
> I
> > am seeing this error, are appreciated.
>
> As mentioned earlier, the error is that the CSV file is not encoded
> using UTF-8. Best solution is to go back to the source where the file
> comes from and pick the option to always save using UTF-8.
>
> Second best solution is to identify what codec is actually being used.
> If you tell us what program generates the CSV file in the first place,
> and the operating system you are using (Windows? Mac? Linux?), we might
> be able to identify the codec being used.
>
> If you can't identify the codec, you can guess. Guessing is bad, for two
> reasons:
>
> - you can waste a lot of time with bad guesses;
>
> - worse, some bad guesses won't give you an error, but will just
>   give you bad data.
>
> Nevertheless, you can try using a different encoding when you open the
> file. Try this:
>
> ifd = open("infile.csv", 'r', encoding='latin-1')
>
> "Latin 1" is an encoding which should not fail, but it might give back
> rubbish data. Such rubbish data is often called "moji-bake":
>
> en.wikipedia.org/wiki/Mojibake
>
> Another option is to cover up the errors by passing an error handler:
>
> ifd = open("infile.csv", 'r', errors='replace')
>
> which will replace any undecodable bytes in the file with a "missing
> character".
>
>
> --
> Steven
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> To unsubscribe or change subscription options:
> https://mail.python.org/mailman/listinfo/tutor
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20131028/f993cd2f/attachment.html>