[Tutor] read in text file containing non-English characters

Sat Jan 14 01:33:09 CET 2012

On 13/01/12 08:20, Francis P. Boscoe wrote:
>
>
> Given a simple text file of departments, capitals, longitude and latitude
> separated by commas
>
> Ahuachapán,Ahuachapán,-89.8450,13.9190
> Cabañas,Sensuntepeque,-88.6300,13.8800
> Cuscatlán,Cojutepeque,-88.9333,13.7167
>
> I would like to know to how to read in the file and then access arbitary
> rows in the file, so that I can print a line such as:
>
> The capital of Cabañas is Sensuntepeque
>
> while preserving the non-English characters

What version of Python are you using? This is likely to be easier in Python 
3.1 or 3.2; if you can upgrade to either of those, that will make your life 
easier in the long run.

First off, you need to know what encoding the source file is. You call it a 
"simple text file", but there is no such thing once you include non-ASCII 
values! The truth is, there never was such a thing, but so long as people only 
included ASCII characters in files, we could ignore the complexity.

If you don't understand what I mean by "encoding", I strongly recommend you 
read Joel On Software:

http://www.joelonsoftware.com/articles/Unicode.html

If you don't know what the encoding is, you have to guess, or give up. 
Possibly ask the supplier of the file. There are software libraries which will 
read a text file and try to guess the encoding for you. Or when in doubt, just 
try UTF-8.

I have created a text file containing the three lines above, starting with 
Ahuachapán. Because I have created it, I know that the encoding I used was 
UTF-8. (If you are creating your own data files, *always* use UTF-8 unless you 
have a specific reason why you shouldn't.) But I'm going to pretend that I 
don't know this, and show you what happens when I get the encoding wrong.

This is using Python 2.6.

py> import codecs
py> for line in codecs.open('test.txt', encoding='latin1'):
...     print line.strip()  # strip() removes the trailing newline
...
AhuachapÃ¡n,AhuachapÃ¡n,-89.8450,13.9190
CabaÃ±as,Sensuntepeque,-88.6300,13.8800
CuscatlÃ¡n,Cojutepeque,-88.9333,13.7167

So I got the encoding wrong. The incorrect characters like Ã± are often known 
by the Japanese term "moji-bake", and that's a good sign of encoding problems. 
Here's another wrong guess:

py> for line in codecs.open('test.txt', encoding='ascii'):
...     print line.strip()
...
Traceback (most recent call last):
   [ ... traceback deleted for brevity ... ]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 8: 
ordinal not in range(128)

So when you get the encoding wrong, two things may happen: you get an error, 
telling you you got it wrong, or you get junk output, which if you are really 
unlucky might look like legitimate output.

This is what happens when I use the correct encoding:

py> for line in codecs.open('test.txt', encoding='utf8'):
...     print line.strip()
...
Ahuachapán,Ahuachapán,-89.8450,13.9190
Cabañas,Sensuntepeque,-88.6300,13.8800
Cuscatlán,Cojutepeque,-88.9333,13.7167

It just works perfectly.

Now, this is how to actually do some useful work with the data. Assuming you 
are using at least version 2.6 (possibly even 2.5, but definitely not 2.4) 
this should work nicely:

py> from collections import namedtuple
py> Record = namedtuple('Record', 'capital region x y')
py> data = []
py> for line in codecs.open('test.txt', encoding='utf8'):
...     line = line.strip()
...     data.append(Record(*line.split(',')))
...
py> for record in data:
...     print "The capital of", record.region, "is", record.capital
...
The capital of Ahuachapán is Ahuachapán
The capital of Sensuntepeque is Cabañas
The capital of Cojutepeque is Cuscatlán

Hope this helps and gets you started.

Regards,

-- 
Steven