[Tutor] UnicodeDecodeError while parsing a .csv file.

Tue Oct 29 02:33:32 CET 2013

Hi Bob, Thanks, very much, for your quick and detailed reply. This is just
a utility script to read some sentiment analysis data to manipulate the
positive and negative sentiments of multiple people into a single sentiment
per line. The data I got was from some public domain which I have no
control over. What worked was Steve's suggestion to ignore the errors (I
made sure that my results are not messed up when I choose to ignore the
errors).
Thanks for the other suggestions. I haven't done much of file I/O in
python. Hence the crude method that I used.

On Mon, Oct 28, 2013 at 7:31 PM, bob gailer <bgailer at gmail.com> wrote:

> On 10/28/2013 6:13 PM, SM wrote:
> > Hello,
> Hi welcome to the Tutor list
>
>
> > I have an extremely simple piece of code
>
> which could be even simpler - see my comments below
>
>
> > which reads a .csv file, which has 1000 lines of fixed fields, one line
> at a time, and tries to print some values.
> >
> >   1 #!/usr/bin/python3
> >   2 #
> >   3 import sys, time, re, os
> >   4
> >   5 if __name__=="__main__":
> >   6
> >   7     ifd = open("infile.csv", 'r')
>
> The simplest way to discard the first line is to follow the open with
> 8     ifd.readline()
>
> The simplest way to track line number is
>
> 10     for linenum, line in enumerate(ifd, 1):
>
> >  11         line1 = line.split(",")
>
> FWIW you don't need re to do this split
>
> >  12         total = 0
>
> >  19         print("LINE: ", linenum, line1[1])
> >  20         for i in range(1,8):
> >  21             if line1[i].strip():
> >  22                 print("line[i] ", int(line1[i]))
> >  23                 total = total + int(line1[i])
> >  24         print("Total: ", total)
> >  25
> >  26         if total >= 4:
> >  27             print("POSITIVE")
> >  28         else:
> >  29             print("Negative")
> >  31     ifd.close()
>
> That should have () after it, since it is a method call.
>
> >
> > It works fine till  it parses the 1st 126 lines in the input file. For
> the 127th line (irrespective of the contents of the actual line), it prints
> the following error:
> > Traceback (most recent call last):
> >   File "p1.py", line 10, in <module>
> >     for line in ifd:
> >   File "/usr/lib/python3.2/codecs.py"**, line 300, in decode
> >     (result, consumed) = self._buffer_decode(data, self.errors, final)
> > UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position
> 1173: invalid continuation byte
> Do you get exactly the same message irrespective of the contents of the
> actual line?
>
> "Code points larger than 127 are represented by multi-byte sequences,
> composed of a leading byte and one or more continuation bytes. The leading
> byte has two or more high-order 1s followed by a 0, while continuation
> bytes all have '10' in the high-order position."
>
> This suggests that a byte close to the end of the previous line is
> "leading byte"and therefore a continuation byte was expected but where the
> 0xe9was found.
>
> BTWhen I divide 1173 by 126 I get something close to 9 characters per lne.
> That is not possible, as there would have to be at least 16 characters in
> each line.
>
> Best you send us at least the first 130 lines so we can play with the file.
>
> --
> Bob Gailer
> 919-636-4239
> Chapel Hill NC
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20131028/7d024cd3/attachment.html>