encoding error in python 27
Peter Otten
__peter__ at web.de
Sun Feb 24 03:34:57 EST 2013
Hala Gamal wrote:
> thank you :)it worked well for small file but when i enter big file,, i
> obtain this error: "Traceback (most recent call last):
> File "D:\Python27\yarab (4).py", line 46, in <module>
> writer.add_document(**doc)
> File "build\bdist.win32\egg\whoosh\filedb\filewriting.py", line 369, in
> add_document
> items = field.index(value)
> File "build\bdist.win32\egg\whoosh\fields.py", line 466, in index
> return [(txt, 1, 1.0, '') for txt in self._tiers(num)]
> File "build\bdist.win32\egg\whoosh\fields.py", line 454, in _tiers
> yield self.to_text(num, shift=shift)
> File "build\bdist.win32\egg\whoosh\fields.py", line 487, in to_text
> return self._to_text(self.prepare_number(x), shift=shift,
> File "build\bdist.win32\egg\whoosh\fields.py", line 476, in
> prepare_number
> x = self.type(x)
> UnicodeEncodeError: 'decimal' codec can't encode characters in position
> 0-4: invalid decimal Unicode string" i don't know realy where is the
> problem? On Friday, February 22, 2013 4:55:22 PM UTC+2, Hala Gamal wrote:
>> my code works well with english file but when i use text file
>> encodede"utf-8" "my file contain some arabic letters" it doesn't work.
I guess that one of the fields you require to be NUMERIC contains non-digit
characters. Replace the line
>> writer.add_document(**doc)
with something similar to
try:
writer.add_document(**doc)
except UnicodeEncodeError:
print "Skipping malformed line", repr(i)
This will allow you to inspect the lines your script cannot handle and if
they are indeed "malformed" as I am guessing you can fix your input data.
i is a terrible name for a line in a file, btw. Also, you should avoid
readlines() which reads the whole file into memory and instead iterate over
the file object directly:
with codecs.open("tt.txt", encoding='utf-8-sig') as textfile:
for line in textfile: # no readlines(), can handle
# text files of arbitrary size
...
More information about the Python-list
mailing list