[Tutor] UnicodeDecodeError while parsing a .csv file.
SM
sunithanc at gmail.com
Tue Oct 29 02:35:42 CET 2013
Hi Steven, Thanks, very much, for the very detailed reply. It was very
useful.
This is just a utility script to read some sentiment analysis data to
manipulate the positive and negative sentiments of multiple people into a
single sentiment per line. So the data I got was from some public domain
which I have no control over. What worked was your suggestion to ignore the
errors (I made sure that my results are not messed up when I choose to
ignore the errors).
Thanks, much.
On Mon, Oct 28, 2013 at 7:49 PM, Steven D'Aprano <steve at pearwood.info>wrote:
> On Mon, Oct 28, 2013 at 06:13:59PM -0400, SM wrote:
> > Hello,
> > I have an extremely simple piece of code which reads a .csv file, which
> has
> > 1000 lines of fixed fields, one line at a time, and tries to print some
> > values.
> >
> > 1 #!/usr/bin/python3
> > 2 #
> > 3 import sys, time, re, os
> > 4
> > 5 if __name__=="__main__":
> > 6
> > 7 ifd = open("infile.csv", 'r')
>
> By default Python 3 uses UTF-8 when reading files. As the error below
> shows, your file actually isn't UTF-8.
>
> What are you using to generate the CSV file? Consult the documentation
> for that program and see what it is using. If it has an option to save
> using UTF-8, use that.
>
> See below for more discussion.
>
>
> > 8
> > 9 linenum = 0
> > 10 for line in ifd:
> > 11 line1 = re.split(",", line)
> > 12 total = 0
> > 13 if linenum == 0:
> > 14 linenum = linenum + 1
> > 15 continue
> [snip many more lines of code]
>
> All of this manual effort is unnecessary, as Python comes standard with
> a library to read CSV files. It is much better to use that:
>
> http://docs.python.org/3/library/csv.html
>
> > 31 ifd.close
>
> This line is buggy. To close the file, you need to *call* the close
> method by using parentheses, that is, you must write:
>
> ifd.close()
>
>
> Without the parentheses, you just get a reference to the close methof
> but don't do anything with it.
>
>
> > It works fine till it parses the 1st 126 lines in the input file. For
> the
> > 127th line (irrespective of the contents of the actual line), it prints
> the
> > following error:
> > Traceback (most recent call last):
> > File "p1.py", line 10, in <module>
> > for line in ifd:
> > File "/usr/lib/python3.2/codecs.py", line 300, in decode
> > (result, consumed) = self._buffer_decode(data, self.errors, final)
> > UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position
> 1173:
> > invalid continuation byte
> > $
> >
> > I am not able to figure out the cause of this error. Any clues as to why
> I
> > am seeing this error, are appreciated.
>
> As mentioned earlier, the error is that the CSV file is not encoded
> using UTF-8. Best solution is to go back to the source where the file
> comes from and pick the option to always save using UTF-8.
>
> Second best solution is to identify what codec is actually being used.
> If you tell us what program generates the CSV file in the first place,
> and the operating system you are using (Windows? Mac? Linux?), we might
> be able to identify the codec being used.
>
> If you can't identify the codec, you can guess. Guessing is bad, for two
> reasons:
>
> - you can waste a lot of time with bad guesses;
>
> - worse, some bad guesses won't give you an error, but will just
> give you bad data.
>
> Nevertheless, you can try using a different encoding when you open the
> file. Try this:
>
> ifd = open("infile.csv", 'r', encoding='latin-1')
>
> "Latin 1" is an encoding which should not fail, but it might give back
> rubbish data. Such rubbish data is often called "moji-bake":
>
> en.wikipedia.org/wiki/Mojibake
>
> Another option is to cover up the errors by passing an error handler:
>
> ifd = open("infile.csv", 'r', errors='replace')
>
> which will replace any undecodable bytes in the file with a "missing
> character".
>
>
> --
> Steven
> _______________________________________________
> Tutor maillist - Tutor at python.org
> To unsubscribe or change subscription options:
> https://mail.python.org/mailman/listinfo/tutor
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20131028/f993cd2f/attachment.html>
More information about the Tutor
mailing list