[Tutor] UnicodeDecodeError while parsing a .csv file.
bob gailer
bgailer at gmail.com
Tue Oct 29 00:31:54 CET 2013
On 10/28/2013 6:13 PM, SM wrote:
> Hello,
Hi welcome to the Tutor list
> I have an extremely simple piece of code
which could be even simpler - see my comments below
> which reads a .csv file, which has 1000 lines of fixed fields, one
line at a time, and tries to print some values.
>
> 1 #!/usr/bin/python3
> 2 #
> 3 import sys, time, re, os
> 4
> 5 if __name__=="__main__":
> 6
> 7 ifd = open("infile.csv", 'r')
The simplest way to discard the first line is to follow the open with
8 ifd.readline()
The simplest way to track line number is
10 for linenum, line in enumerate(ifd, 1):
> 11 line1 = line.split(",")
FWIW you don't need re to do this split
> 12 total = 0
> 19 print("LINE: ", linenum, line1[1])
> 20 for i in range(1,8):
> 21 if line1[i].strip():
> 22 print("line[i] ", int(line1[i]))
> 23 total = total + int(line1[i])
> 24 print("Total: ", total)
> 25
> 26 if total >= 4:
> 27 print("POSITIVE")
> 28 else:
> 29 print("Negative")
> 31 ifd.close()
That should have () after it, since it is a method call.
>
> It works fine till it parses the 1st 126 lines in the input file.
For the 127th line (irrespective of the contents of the actual line), it
prints the following error:
> Traceback (most recent call last):
> File "p1.py", line 10, in <module>
> for line in ifd:
> File "/usr/lib/python3.2/codecs.py", line 300, in decode
> (result, consumed) = self._buffer_decode(data, self.errors, final)
> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position
1173: invalid continuation byte
Do you get exactly the same message irrespective of the contents of the
actual line?
"Code points larger than 127 are represented by multi-byte sequences,
composed of a leading byte and one or more continuation bytes. The
leading byte has two or more high-order 1s followed by a 0, while
continuation bytes all have '10' in the high-order position."
This suggests that a byte close to the end of the previous line is
"leading byte"and therefore a continuation byte was expected but where
the 0xe9was found.
BTWhen I divide 1173 by 126 I get something close to 9 characters per
lne. That is not possible, as there would have to be at least 16
characters in each line.
Best you send us at least the first 130 lines so we can play with the file.
--
Bob Gailer
919-636-4239
Chapel Hill NC
More information about the Tutor
mailing list