[Tutor] input file encoding
Kent Johnson
kent37 at tds.net
Tue Sep 11 12:50:11 CEST 2007
Tim Golden wrote:
> Tim Michelsen wrote:
>> Hello,
>> I want to process some files encoded in latin-1 (iso-8859-1) in my
>> python script that I write on Ubuntu which has UTF-8 as standard encoding.
>
> Not sure what you mean by "standard encoding" (is this an Ubuntu
> thing?)
Probably referring to the encoding the terminal application expects -
writing latin-1 chars when the terminal expects utf-8 will not work well.
Python also has a default encoding but that is ascii unless you change
it yourself.
> In this case, assuming you have files in iso-8859-1, something
> like this:
>
> <code>
> import codecs
>
> filenames = ['a.txt', 'b.txt', 'c.txt']
> for filename in filenames:
> f = codecs.open (filename, encoding="iso-8859-1")
> text = f.read ()
> #
> # If you want to re-encode this -- not sure why --
This is needed to put the text into the proper encoding for the
terminal. If you print a unicode string directly it will be encoded
using the system default encoding (ascii) which will fail:
In [13]: print u'\xe2'
------------------------------------------------------------
Traceback (most recent call last):
File "<ipython console>", line 1, in <module>
<type 'exceptions.UnicodeEncodeError'>: 'ascii' codec can't encode
character u'\xe2' in position 0: ordinal not in range(128)
In [14]: print u'\xe2'.encode('utf-8')
â
> # you could do this:
> # text = text.encode ("utf-8")
> print repr (text)
No, not repr, that will print with \ escapes and quotes.
In [15]: print repr(u'\xe2'.encode('utf-8'))
'\xc3\xa2'
And he may not want to change text itself to utf-8. Just
print text.encode('utf-8')
Kent
More information about the Tutor
mailing list