Dear Tutors,<br>I'm trying to get the most frequent words in an Arabic text. I wrote the following code and tried it on English and it works fine, but when I try it on Arabic, all I get is the slashes and x's. I'm not familiar with Unicode. Could somebody please tell me what's wrong here, and how I can get the actual Arabic words?<br>
Thank you in anticipation<br><br><br>import codecs<br>infile = codecs.open(r'C:\Documents and Settings\Emad\Desktop\milal.txt', 'r', 'utf-8').read().split()<br>num = {}<br>for word in infile:<br> if word not in num:<br>
num[word] = 1<br> num[word] +=1<br>new = zip(num.values(), num.keys())<br>new.sort()<br>new.reverse()<br>outfile = codecs.open(r'C:\Documents and Settings\Emad\Desktop\milalwanihal.txt', 'w', 'utf-8')<br>
for word in new:<br> print >> out, word<br>out.close()<br><br clear="all"><br>-- <br>لا أعرف مظلوما تواطأ الناس علي هضمه ولا زهدوا في إنصافه كالحقيقة.....محمد الغزالي<br>"No victim has ever been more repressed and alienated than the truth"<br>
<br>Emad Soliman Nawfal<br>Indiana University, Bloomington<br><a href="http://emnawfal.googlepages.com">http://emnawfal.googlepages.com</a><br>--------------------------------------------------------