encoding error in python 27

Sun Feb 24 03:34:57 EST 2013

Hala Gamal wrote:

> thank you :)it worked well for small file but when i enter big file,, i
> obtain this error: "Traceback (most recent call last):
>   File "D:\Python27\yarab (4).py", line 46, in <module>
>     writer.add_document(**doc)
>   File "build\bdist.win32\egg\whoosh\filedb\filewriting.py", line 369, in
>   add_document
>     items = field.index(value)
>   File "build\bdist.win32\egg\whoosh\fields.py", line 466, in index
>     return [(txt, 1, 1.0, '') for txt in self._tiers(num)]
>   File "build\bdist.win32\egg\whoosh\fields.py", line 454, in _tiers
>     yield self.to_text(num, shift=shift)
>   File "build\bdist.win32\egg\whoosh\fields.py", line 487, in to_text
>     return self._to_text(self.prepare_number(x), shift=shift,
>   File "build\bdist.win32\egg\whoosh\fields.py", line 476, in
>   prepare_number
>     x = self.type(x)
> UnicodeEncodeError: 'decimal' codec can't encode characters in position
> 0-4: invalid decimal Unicode string" i don't know realy where is the
> problem? On Friday, February 22, 2013 4:55:22 PM UTC+2, Hala Gamal wrote:
>> my code works well with english file but when i use text file
>> encodede"utf-8" "my file contain some arabic letters" it doesn't work.

I guess that one of the fields you require to be NUMERIC contains non-digit 
characters. Replace the line

>>       writer.add_document(**doc)

with something similar to

         try:
             writer.add_document(**doc)
         except UnicodeEncodeError:
             print "Skipping malformed line", repr(i) 

This will allow you to inspect the lines your script cannot handle and if 
they are indeed "malformed" as I am guessing you can fix your input data.

i is a terrible name for a line in a file, btw. Also, you should avoid 
readlines() which reads the whole file into memory and instead iterate over 
the file object directly:

with codecs.open("tt.txt", encoding='utf-8-sig') as textfile:
    for line in textfile: # no readlines(), can handle 
                          # text files of arbitrary size
        ...