[Tutor] read in text file containing non-English characters

Martin A. Brown martin at linux-ip.net
Fri Jan 13 04:18:34 CET 2012


Greetings Francis,

You have entered the Unicode or multiple character set zone.  This 
is the deep end of the pool, and even experienced practitioners have 
difficulty here.  Fortunately, Python eases the burden on you, but 
this still requires some care.

 : Given a simple text file of departments, capitals, longitude and 
 : latitude separated by commas

Side notes--if you are dealing with geographic hierarchies, you may 
wish to consider using some publicly 'standard' available hierarchy.  

Since you are mailing from a .state.ny.us address, I might guess 
that you are working inside a context in which you may not have 
control over the source of the geographic hierachical data, however, 
I'll point out the following:

  * GeoNames:  http://www.geonames.org/
  * GNIS: http://en.wikipedia.org/wiki/Geographic_Names_Information_System

Apologies to somebody with actual brains who said this, but:  The 
great thing about standards, is that you have so many to choose 
from.

OK, so that's unrelated to your direct question.  You have a few 
capitals and latlongs that you want to read from a comma-separated 
file.

 : Ahuachapán,Ahuachapán,-89.8450,13.9190
 : Cabañas,Sensuntepeque,-88.6300,13.8800
 : Cuscatlán,Cojutepeque,-88.9333,13.7167
 : 
 : I would like to know to how to read in the file and then access 
 : arbitary rows in the file, so that I can print a line such as:
 : 
 : The capital of Cabañas is Sensuntepeque
 : 
 : while preserving the non-English characters
 : 
 : now, for example, I get
 : 
 : Cabañas

You don't show even a snippet of code.  If you are asking 
for help here, it is good form to show us your code.  Since 
you don't state how you are reading the data and how you are 
printing the data, we can't help much.  Here are some tips:

  * Consider learning how to use the csv module, particularly in 
    your case, csv.reader (as Ramit Prasad has already suggested).

  * Consider checking the bytestream to see if the bytes produced
    on output are the same as on input (also, read the text that 
    Mark Tompkins indicated and learn to distinguish Unicode from 
    UTF-8).

  * Report back to the list the version of Python you are using.
    [Different versions of Python have subtly different handling of
    non ASCII character set data, but this should probably not be an 
    issue for the more obvious issue you are showing above.]

We can have no idea what your ultimate goal is with the data, but 
can help you much more if you show us the code.

Here's a sample of what I would/could do (Python 2.6.5):

    import csv
    reader = csv.reader(open('input-data.txt'),delimiter=',')
    for row in reader:
        print 'The capital of %s is %s' % (row[0], row[1],)

The above is trivial, but if you would like some more substantive 
assistance, you should describe your problem in a bit more detail.

-Martin

-- 
Martin A. Brown
http://linux-ip.net/


More information about the Tutor mailing list