[Tutor] UnicodeDecodeError while parsing a .csv file.
SM
sunithanc at gmail.com
Tue Oct 29 02:33:32 CET 2013
Hi Bob, Thanks, very much, for your quick and detailed reply. This is just
a utility script to read some sentiment analysis data to manipulate the
positive and negative sentiments of multiple people into a single sentiment
per line. The data I got was from some public domain which I have no
control over. What worked was Steve's suggestion to ignore the errors (I
made sure that my results are not messed up when I choose to ignore the
errors).
Thanks for the other suggestions. I haven't done much of file I/O in
python. Hence the crude method that I used.
On Mon, Oct 28, 2013 at 7:31 PM, bob gailer <bgailer at gmail.com> wrote:
> On 10/28/2013 6:13 PM, SM wrote:
> > Hello,
> Hi welcome to the Tutor list
>
>
> > I have an extremely simple piece of code
>
> which could be even simpler - see my comments below
>
>
> > which reads a .csv file, which has 1000 lines of fixed fields, one line
> at a time, and tries to print some values.
> >
> > 1 #!/usr/bin/python3
> > 2 #
> > 3 import sys, time, re, os
> > 4
> > 5 if __name__=="__main__":
> > 6
> > 7 ifd = open("infile.csv", 'r')
>
> The simplest way to discard the first line is to follow the open with
> 8 ifd.readline()
>
> The simplest way to track line number is
>
> 10 for linenum, line in enumerate(ifd, 1):
>
> > 11 line1 = line.split(",")
>
> FWIW you don't need re to do this split
>
> > 12 total = 0
>
> > 19 print("LINE: ", linenum, line1[1])
> > 20 for i in range(1,8):
> > 21 if line1[i].strip():
> > 22 print("line[i] ", int(line1[i]))
> > 23 total = total + int(line1[i])
> > 24 print("Total: ", total)
> > 25
> > 26 if total >= 4:
> > 27 print("POSITIVE")
> > 28 else:
> > 29 print("Negative")
> > 31 ifd.close()
>
> That should have () after it, since it is a method call.
>
> >
> > It works fine till it parses the 1st 126 lines in the input file. For
> the 127th line (irrespective of the contents of the actual line), it prints
> the following error:
> > Traceback (most recent call last):
> > File "p1.py", line 10, in <module>
> > for line in ifd:
> > File "/usr/lib/python3.2/codecs.py"**, line 300, in decode
> > (result, consumed) = self._buffer_decode(data, self.errors, final)
> > UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position
> 1173: invalid continuation byte
> Do you get exactly the same message irrespective of the contents of the
> actual line?
>
> "Code points larger than 127 are represented by multi-byte sequences,
> composed of a leading byte and one or more continuation bytes. The leading
> byte has two or more high-order 1s followed by a 0, while continuation
> bytes all have '10' in the high-order position."
>
> This suggests that a byte close to the end of the previous line is
> "leading byte"and therefore a continuation byte was expected but where the
> 0xe9was found.
>
> BTWhen I divide 1173 by 126 I get something close to 9 characters per lne.
> That is not possible, as there would have to be at least 16 characters in
> each line.
>
> Best you send us at least the first 130 lines so we can play with the file.
>
> --
> Bob Gailer
> 919-636-4239
> Chapel Hill NC
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20131028/7d024cd3/attachment.html>
More information about the Tutor
mailing list