<div dir="ltr"><div>Hi Steven, Thanks, very much, for the very detailed reply. It was very useful. <br>This is just a utility script to read some sentiment analysis data to 

manipulate the positive and negative sentiments of multiple people into a

 single sentiment per line. So the data I got was from some public 

domain which I have no control over. What worked was your suggestion 

to ignore the errors (I made sure that my results are not messed up when

 I choose to ignore the errors).<br></div>Thanks, much.<br></div><div class="gmail_extra"><br><br><div class="gmail_quote">On Mon, Oct 28, 2013 at 7:49 PM, Steven D'Aprano <span dir="ltr"><<a href="mailto:steve@pearwood.info" target="_blank">steve@pearwood.info</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="im">On Mon, Oct 28, 2013 at 06:13:59PM -0400, SM wrote:<br>

> Hello,<br>

> I have an extremely simple piece of code which reads a .csv file, which has<br>

> 1000 lines of fixed fields, one line at a time, and tries to print some<br>

> values.<br>

><br>

>   1 #!/usr/bin/python3<br>

>   2 #<br>

>   3 import sys, time, re, os<br>

>   4<br>

>   5 if __name__=="__main__":<br>

>   6<br>

>   7     ifd = open("infile.csv", 'r')<br>

<br>

</div>By default Python 3 uses UTF-8 when reading files. As the error below<br>

shows, your file actually isn't UTF-8.<br>

<br>

What are you using to generate the CSV file? Consult the documentation<br>

for that program and see what it is using. If it has an option to save<br>

using UTF-8, use that.<br>

<br>

See below for more discussion.<br>

<div class="im"><br>

<br>

>   8<br>

>   9     linenum = 0<br>

>  10     for line in ifd:<br>

>  11         line1 = re.split(",", line)<br>

>  12         total = 0<br>

>  13         if linenum == 0:<br>

>  14             linenum = linenum + 1<br>

>  15             continue<br>

</div>[snip many more lines of code]<br>

<br>

All of this manual effort is unnecessary, as Python comes standard with<br>

a library to read CSV files. It is much better to use that:<br>

<br>

<a href="http://docs.python.org/3/library/csv.html" target="_blank">http://docs.python.org/3/library/csv.html</a><br>

<br>

>  31     ifd.close<br>

<br>

This line is buggy. To close the file, you need to *call* the close<br>

method by using parentheses, that is, you must write:<br>

<br>

ifd.close()<br>

<br>

<br>

Without the parentheses, you just get a reference to the close methof<br>

but don't do anything with it.<br>

<div class="im"><br>

<br>

> It works fine till  it parses the 1st 126 lines in the input file. For the<br>

> 127th line (irrespective of the contents of the actual line), it prints the<br>

> following error:<br>

> Traceback (most recent call last):<br>

>   File "p1.py", line 10, in <module><br>

>     for line in ifd:<br>

>   File "/usr/lib/python3.2/codecs.py", line 300, in decode<br>

>     (result, consumed) = self._buffer_decode(data, self.errors, final)<br>

> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 1173:<br>

> invalid continuation byte<br>

> $<br>

><br>

> I am not able to figure out the cause of this error. Any clues as to why I<br>

> am seeing this error, are appreciated.<br>

<br>

</div>As mentioned earlier, the error is that the CSV file is not encoded<br>

using UTF-8. Best solution is to go back to the source where the file<br>

comes from and pick the option to always save using UTF-8.<br>

<br>

Second best solution is to identify what codec is actually being used.<br>

If you tell us what program generates the CSV file in the first place,<br>

and the operating system you are using (Windows? Mac? Linux?), we might<br>

be able to identify the codec being used.<br>

<br>

If you can't identify the codec, you can guess. Guessing is bad, for two<br>

reasons:<br>

<br>

- you can waste a lot of time with bad guesses;<br>

<br>

- worse, some bad guesses won't give you an error, but will just<br>

  give you bad data.<br>

<br>

Nevertheless, you can try using a different encoding when you open the<br>

file. Try this:<br>

<br>

ifd = open("infile.csv", 'r', encoding='latin-1')<br>

<br>

"Latin 1" is an encoding which should not fail, but it might give back<br>

rubbish data. Such rubbish data is often called "moji-bake":<br>

<br>

<a href="http://en.wikipedia.org/wiki/Mojibake" target="_blank">en.wikipedia.org/wiki/Mojibake</a><br>

<br>

Another option is to cover up the errors by passing an error handler:<br>

<br>

ifd = open("infile.csv", 'r', errors='replace')<br>

<br>

which will replace any undecodable bytes in the file with a "missing<br>

character".<br>

<span class="HOEnZb"><font color="#888888"><br>

<br>

--<br>

Steven<br>

_______________________________________________<br>

Tutor maillist  -  <a href="mailto:Tutor@python.org">Tutor@python.org</a><br>

To unsubscribe or change subscription options:<br>

<a href="https://mail.python.org/mailman/listinfo/tutor" target="_blank">https://mail.python.org/mailman/listinfo/tutor</a><br>

</font></span></blockquote></div><br></div>