[Tutor] Word List

Sun Mar 9 21:35:10 CET 2008

Emad Nawfal wrote:
> Dear Tutors,
> I'm trying to get the most frequent words in an Arabic text. I wrote the 
> following code and tried it on English and it works fine, but when I try 
> it on Arabic, all I get is the slashes and x's.

> import codecs
> infile = codecs.open(r'C:\Documents and 
> Settings\Emad\Desktop\milal.txt', 'r', 'utf-8').read().split()
> num = {}
> for word in infile:
>     if word not in num:
>         num[word] = 1
>     num[word] +=1
> new = zip(num.values(), num.keys())

Note that new is a list of pairs of (count, word), *not* a list of words.

> new.sort()
> new.reverse()
> outfile = codecs.open(r'C:\Documents and 
> Settings\Emad\Desktop\milalwanihal.txt', 'w', 'utf-8')
> for word in new:
>         print >> out, word

So here 'word' is a tuple, not a string.

When you print a tuple, the output is the repr() of the elements of a 
tuple, not the str() of the elements. For strings, this means that 
non-ascii characters are always printed using backslash escapes.

For example:
In [19]: s='é'
In [21]: print s
é
In [25]: t=(s,s)
In [26]: print t
('\xc3\xa9', '\xc3\xa9')

I suggest you format the output yourself. If you want the tuple 
formatting, try this:

for count, word in new: # unpack the tuple to two values
   out.write('(%s, %s)\n' % (count, word))

Kent