[Tutor] UnicodeDecodeError

Sun Mar 15 14:50:53 EDT 2020

On 15/03/2020 18:45, David L Neil via Tutor wrote:
> On 16/03/20 4:41 AM, thehouse.be--- via Tutor wrote:
>> I am a beginner, learning Python.
>> So sorry if my question is basic.
>>
>> I am trying to work with .csv files in order to analyse data which 
>> comes from a Google Forms survey.
>> Idea is to handle the raw data, do some statistical analysis and make 
>> a report.
>>
>> When trying to convert the data into a listy of lists, I get a 
>> UnicodeDecodeError.
>>
>> This is what I do:
>>
>>>>> import csv
>>>>> exampleFile = open(‘example.csv’)
>>>>> exampleReader = csv.reader(exampleFile)
>>>>> exampleData = list(exampleReader)
>>
>> This last statement generates:
>> —————————————————————————————————————
>> UnicodeDecodeError Traceback (most recent call last)
>> <ipython-input-9-3817c0931c6f> in <module>
>> ----> 1 exampleData = list(exampleReader)
>> /Applications/mu-editor.app/Contents/Resources/python/lib/python3.6/encodings/ascii.pyc 
>> in decode(self, input, final)
>> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 
>> 1798: ordinal not in range(128)
>>
>> I suppose there is a bizarre character somewhere in the file, but no 
>> idea where.
>> As we use accents and umlauts in our language, could that be the problem?
>> If that would be the problem, how to solve?
> 
> 
> NB there are major differences in this area between Python2 and Python3. 
> I'm assuming you are using Python3.
> 
> 
> The difficulty, as you say, is that the majority of the world's 
> population use languages which cannot be adequately-expressed using 
> ASCII (*American* Standard Code...) - which also makes this type of 
> question difficult to answer because of the many permutations and 
> combinations...
> 
> If the spreadsheet/original .CSV file was built using MS-Excel and/or on 
> a non-English-speaking MS-Windows machine, then it is highly likely we 
> need to harmonise this Python code with that characteristic.
> 
> Are you able to ascertain such detail? If not, you can probably make an 
> educated guess (given your analysis to-date).
> 
> 
> Microsoft Windows tends to put European users into one of the ISO 8859-x 
> character sets. (but which one? Good news: we may not need to be 
> *exactly* correct in this choice!)
> 
> Python3 works with Unicode by default.
> 
> It is possible to encode and decode between "text encodings". Some 
> experimentation may be necessary.
> 
> Please let us know the results of your investigation/experiments, and/or 
> if that leads to further questions...
> 
> 
> WebRefs:
> https://en.wikipedia.org/wiki/ISO/IEC_8859-1
> https://docs.python.org/3/howto/unicode.html
> https://docs.python.org/3/library/codecs.html

This https://pypi.org/project/chardet/ might also come in handy.

-- 
My fellow Pythonistas, ask not what our language can do for you, ask
what you can do for our language.

Mark Lawrence