[Tutor] Word List
Emad Nawfal
emadnawfal at gmail.com
Sun Mar 9 16:06:58 CET 2008
Dear Tutors,
I'm trying to get the most frequent words in an Arabic text. I wrote the
following code and tried it on English and it works fine, but when I try it
on Arabic, all I get is the slashes and x's. I'm not familiar with Unicode.
Could somebody please tell me what's wrong here, and how I can get the
actual Arabic words?
Thank you in anticipation
import codecs
infile = codecs.open(r'C:\Documents and Settings\Emad\Desktop\milal.txt',
'r', 'utf-8').read().split()
num = {}
for word in infile:
if word not in num:
num[word] = 1
num[word] +=1
new = zip(num.values(), num.keys())
new.sort()
new.reverse()
outfile = codecs.open(r'C:\Documents and
Settings\Emad\Desktop\milalwanihal.txt', 'w', 'utf-8')
for word in new:
print >> out, word
out.close()
--
لا أعرف مظلوما تواطأ الناس علي هضمه ولا زهدوا في إنصافه كالحقيقة.....محمد
الغزالي
"No victim has ever been more repressed and alienated than the truth"
Emad Soliman Nawfal
Indiana University, Bloomington
http://emnawfal.googlepages.com
--------------------------------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/tutor/attachments/20080309/ec31c381/attachment.htm
More information about the Tutor
mailing list