<div dir="ltr"><div>Hi Steven, Thanks, very much, for the very detailed reply. It was very useful. <br>This is just a utility script to read some sentiment analysis data to
manipulate the positive and negative sentiments of multiple people into a
single sentiment per line. So the data I got was from some public
domain which I have no control over. What worked was your suggestion
to ignore the errors (I made sure that my results are not messed up when
I choose to ignore the errors).<br></div>Thanks, much.<br></div><div class="gmail_extra"><br><br><div class="gmail_quote">On Mon, Oct 28, 2013 at 7:49 PM, Steven D'Aprano <span dir="ltr"><<a href="mailto:steve@pearwood.info" target="_blank">steve@pearwood.info</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="im">On Mon, Oct 28, 2013 at 06:13:59PM -0400, SM wrote:<br>
> Hello,<br>
> I have an extremely simple piece of code which reads a .csv file, which has<br>
> 1000 lines of fixed fields, one line at a time, and tries to print some<br>
> values.<br>
><br>
> 1 #!/usr/bin/python3<br>
> 2 #<br>
> 3 import sys, time, re, os<br>
> 4<br>
> 5 if __name__=="__main__":<br>
> 6<br>
> 7 ifd = open("infile.csv", 'r')<br>
<br>
</div>By default Python 3 uses UTF-8 when reading files. As the error below<br>
shows, your file actually isn't UTF-8.<br>
<br>
What are you using to generate the CSV file? Consult the documentation<br>
for that program and see what it is using. If it has an option to save<br>
using UTF-8, use that.<br>
<br>
See below for more discussion.<br>
<div class="im"><br>
<br>
> 8<br>
> 9 linenum = 0<br>
> 10 for line in ifd:<br>
> 11 line1 = re.split(",", line)<br>
> 12 total = 0<br>
> 13 if linenum == 0:<br>
> 14 linenum = linenum + 1<br>
> 15 continue<br>
</div>[snip many more lines of code]<br>
<br>
All of this manual effort is unnecessary, as Python comes standard with<br>
a library to read CSV files. It is much better to use that:<br>
<br>
<a href="http://docs.python.org/3/library/csv.html" target="_blank">http://docs.python.org/3/library/csv.html</a><br>
<br>
> 31 ifd.close<br>
<br>
This line is buggy. To close the file, you need to *call* the close<br>
method by using parentheses, that is, you must write:<br>
<br>
ifd.close()<br>
<br>
<br>
Without the parentheses, you just get a reference to the close methof<br>
but don't do anything with it.<br>
<div class="im"><br>
<br>
> It works fine till it parses the 1st 126 lines in the input file. For the<br>
> 127th line (irrespective of the contents of the actual line), it prints the<br>
> following error:<br>
> Traceback (most recent call last):<br>
> File "p1.py", line 10, in <module><br>
> for line in ifd:<br>
> File "/usr/lib/python3.2/codecs.py", line 300, in decode<br>
> (result, consumed) = self._buffer_decode(data, self.errors, final)<br>
> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 1173:<br>
> invalid continuation byte<br>
> $<br>
><br>
> I am not able to figure out the cause of this error. Any clues as to why I<br>
> am seeing this error, are appreciated.<br>
<br>
</div>As mentioned earlier, the error is that the CSV file is not encoded<br>
using UTF-8. Best solution is to go back to the source where the file<br>
comes from and pick the option to always save using UTF-8.<br>
<br>
Second best solution is to identify what codec is actually being used.<br>
If you tell us what program generates the CSV file in the first place,<br>
and the operating system you are using (Windows? Mac? Linux?), we might<br>
be able to identify the codec being used.<br>
<br>
If you can't identify the codec, you can guess. Guessing is bad, for two<br>
reasons:<br>
<br>
- you can waste a lot of time with bad guesses;<br>
<br>
- worse, some bad guesses won't give you an error, but will just<br>
give you bad data.<br>
<br>
Nevertheless, you can try using a different encoding when you open the<br>
file. Try this:<br>
<br>
ifd = open("infile.csv", 'r', encoding='latin-1')<br>
<br>
"Latin 1" is an encoding which should not fail, but it might give back<br>
rubbish data. Such rubbish data is often called "moji-bake":<br>
<br>
<a href="http://en.wikipedia.org/wiki/Mojibake" target="_blank">en.wikipedia.org/wiki/Mojibake</a><br>
<br>
Another option is to cover up the errors by passing an error handler:<br>
<br>
ifd = open("infile.csv", 'r', errors='replace')<br>
<br>
which will replace any undecodable bytes in the file with a "missing<br>
character".<br>
<span class="HOEnZb"><font color="#888888"><br>
<br>
--<br>
Steven<br>
_______________________________________________<br>
Tutor maillist - <a href="mailto:Tutor@python.org">Tutor@python.org</a><br>
To unsubscribe or change subscription options:<br>
<a href="https://mail.python.org/mailman/listinfo/tutor" target="_blank">https://mail.python.org/mailman/listinfo/tutor</a><br>
</font></span></blockquote></div><br></div>