[Tutor] Encoding error when reading text files in Python 3

Dat Huynh htdatcse at gmail.com
Sat Jul 28 12:45:47 CEST 2012

I change my code and it runs on Python 3 now.

           f = open(rootdir+file, 'rb')
          data = f.read().decode('utf8', 'ignore')

Thank you very much.

On Sat, Jul 28, 2012 at 6:09 PM, Steven D'Aprano <steve at pearwood.info> wrote:
> Dat Huynh wrote:
>> Dear all,
>> I have written a simple application by Python to read data from text
>> files.
>> Current I have both Python version 2.7.2 and Python 3.2.3 on my laptop.
>> I don't know why it does not run on Python version 3 while it runs
>> well on Python 2.
> Python 2 is more forgiving of beginner errors when dealing with text and
> bytes, but makes it harder to deal with text correctly.
> Python 3 makes it easier to deal with text correctly, but is less forgiving.
> When you read from a file in Python 2, it will give you *something*, even if
> it is the wrong thing. It will not give an decoding error, even if the text
> you are reading is not valid text. It will just give you junk bytes,
> sometimes known as moji-bake.
> Python 3 no longer does that. It tells you when there is a problem, so you
> can fix it.
>> Could you please tell me how I can run it on python 3?
>> Following is my Python code.
>>  ------------------------------
>>    for subdir, dirs, files in os.walk(rootdir):
>>         for file in files:
>>             print("Processing [" +file +"]...\n" )
>>             f = open(rootdir+file, 'r')
>>             data = f.read()
>>             f.close()
>>             print(data)
>> ------------------------------
>> This is the error message:
> [...]
>> UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position
>> 4980: ordinal not in range(128)
> This tells you that you are reading a non-ASCII file but haven't told Python
> what encoding to use, so by default Python uses ASCII.
> Do you know what encoding the file is?
> Do you understand about Unicode text and bytes? If not, I suggest you read
> this article:
> http://www.joelonsoftware.com/articles/Unicode.html
> In Python 3, you can either tell Python what encoding to use:
> f = open(rootdir+file, 'r', encoding='utf8')  # for example
> or you can set an error handler:
> f = open(rootdir+file, 'r', errors='ignore')  # for example
> or both
> f = open(rootdir+file, 'r', encoding='ascii', errors='replace')
> You can see the list of encodings and error handlers here:
> http://docs.python.org/py3k/library/codecs.html
> Unfortunately, Python 2 does not support this using the built-in open
> function. Instead, you have to uses codecs.open instead of the built-in
> open, like this:
> import codecs
> f = codecs.open(rootdir+file, 'r', encoding='utf8')  # for example
> which fortunately works in both Python 2 or 3.
> Or you can read the file in binary mode, and then decode it into text:
> f = open(rootdir+file, 'rb')
> data = f.read()
> f.close()
> text = data.decode('cp866', 'replace')
> print(text)
> If you don't know the encoding, you can try opening the file in Firefox or
> Internet Explorer and see if they can guess it, or you can use the chardet
> library in Python.
> http://pypi.python.org/pypi/chardet
> Or if you don't care about getting moji-bake, you can pretend that the file
> is encoded using Latin-1. That will pretty much read anything, although what
> it gives you may be junk.
> --
> Steven
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> To unsubscribe or change subscription options:
> http://mail.python.org/mailman/listinfo/tutor

More information about the Tutor mailing list