[Tutor] Word List

Emad Nawfal emadnawfal at gmail.com
Sun Mar 9 22:59:35 CET 2008


2008/3/9 Kent Johnson <kent37 at tds.net>:

> Emad Nawfal wrote:
> > Dear Tutors,
> > I'm trying to get the most frequent words in an Arabic text. I wrote the
> > following code and tried it on English and it works fine, but when I try
> > it on Arabic, all I get is the slashes and x's.
>
> > import codecs
> > infile = codecs.open(r'C:\Documents and
> > Settings\Emad\Desktop\milal.txt', 'r', 'utf-8').read().split()
> > num = {}
> > for word in infile:
> >     if word not in num:
> >         num[word] = 1
> >     num[word] +=1
> > new = zip(num.values(), num.keys())
>
> Note that new is a list of pairs of (count, word), *not* a list of words.
>
> > new.sort()
> > new.reverse()
> > outfile = codecs.open(r'C:\Documents and
> > Settings\Emad\Desktop\milalwanihal.txt', 'w', 'utf-8')
> > for word in new:
> >         print >> out, word
>
> So here 'word' is a tuple, not a string.
>
> When you print a tuple, the output is the repr() of the elements of a
> tuple, not the str() of the elements. For strings, this means that
> non-ascii characters are always printed using backslash escapes.
>
> For example:
> In [19]: s='é'
> In [21]: print s
> é
> In [25]: t=(s,s)
> In [26]: print t
> ('\xc3\xa9', '\xc3\xa9')
>
> I suggest you format the output yourself. If you want the tuple
> formatting, try this:
>
> for count, word in new: # unpack the tuple to two values
>   out.write('(%s, %s)\n' % (count, word))
>
> Kent
>

Thank you so much Kent. It works. I have now realized the bad things about
self-learning.

-- 
لا أعرف مظلوما تواطأ الناس علي هضمه ولا زهدوا في إنصافه كالحقيقة.....محمد
الغزالي
"No victim has ever been more repressed and alienated than the truth"

Emad Soliman Nawfal
Indiana University, Bloomington
http://emnawfal.googlepages.com
--------------------------------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/tutor/attachments/20080309/8f31fad3/attachment-0001.htm 


More information about the Tutor mailing list