[Tutor] Word List

Emad Nawfal emadnawfal at gmail.com
Sun Mar 9 16:06:58 CET 2008

Dear Tutors,
I'm trying to get the most frequent words in an Arabic text. I wrote the
following code and tried it on English and it works fine, but when I try it
on Arabic, all I get is the slashes and x's. I'm not familiar with Unicode.
Could somebody please tell me what's wrong here, and how I can get the
actual Arabic words?
Thank you in anticipation

import codecs
infile = codecs.open(r'C:\Documents and Settings\Emad\Desktop\milal.txt',
'r', 'utf-8').read().split()
num = {}
for word in infile:
    if word not in num:
        num[word] = 1
    num[word] +=1
new = zip(num.values(), num.keys())
outfile = codecs.open(r'C:\Documents and
Settings\Emad\Desktop\milalwanihal.txt', 'w', 'utf-8')
for word in new:
        print >> out, word

لا أعرف مظلوما تواطأ الناس علي هضمه ولا زهدوا في إنصافه كالحقيقة.....محمد
"No victim has ever been more repressed and alienated than the truth"

Emad Soliman Nawfal
Indiana University, Bloomington
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/tutor/attachments/20080309/ec31c381/attachment.htm 

More information about the Tutor mailing list