[Tutor] Word List
Kent Johnson
kent37 at tds.net
Sun Mar 9 21:35:10 CET 2008
Emad Nawfal wrote:
> Dear Tutors,
> I'm trying to get the most frequent words in an Arabic text. I wrote the
> following code and tried it on English and it works fine, but when I try
> it on Arabic, all I get is the slashes and x's.
> import codecs
> infile = codecs.open(r'C:\Documents and
> Settings\Emad\Desktop\milal.txt', 'r', 'utf-8').read().split()
> num = {}
> for word in infile:
> if word not in num:
> num[word] = 1
> num[word] +=1
> new = zip(num.values(), num.keys())
Note that new is a list of pairs of (count, word), *not* a list of words.
> new.sort()
> new.reverse()
> outfile = codecs.open(r'C:\Documents and
> Settings\Emad\Desktop\milalwanihal.txt', 'w', 'utf-8')
> for word in new:
> print >> out, word
So here 'word' is a tuple, not a string.
When you print a tuple, the output is the repr() of the elements of a
tuple, not the str() of the elements. For strings, this means that
non-ascii characters are always printed using backslash escapes.
For example:
In [19]: s='é'
In [21]: print s
é
In [25]: t=(s,s)
In [26]: print t
('\xc3\xa9', '\xc3\xa9')
I suggest you format the output yourself. If you want the tuple
formatting, try this:
for count, word in new: # unpack the tuple to two values
out.write('(%s, %s)\n' % (count, word))
Kent
More information about the Tutor
mailing list