[Tutor] Unicode trouble

Kent Johnson kent37 at tds.net
Wed Nov 30 19:41:54 CET 2005


Øyvind wrote:
> Øyvind wrote:
> 
>>>Where are you getting these errors (what line of the program)? Do you
>>
>>know >what kind of strings objSelection.Find.Execute() is expecting?
>>
>>
>>>Kent
>>
>>
>>>The program stops working and gives me these errors when I try to run it
>>>when it encounters a non-english letter.
>>
>>>This is the full error:
>>>Traceback (most recent call last):
>>>  File
>>>"C:\Python23\Lib\site-packages\pythonwin\pywin\framework\scriptutils.py",
>>>line 310, in RunScript
>>>    exec codeObject in __main__.__dict__
>>>  File "C:\Python\BA\Oversett.py", line 47, in ?
>>>  File "C:\Python\BA\Oversett.py", line 23, in kjor
>>>    en = i.split('\t')[0]
>>>  File "C:\Python23\lib\codecs.py", line 388, in readlines
>>>    return self.reader.readlines(sizehint)
>>>  File "C:\Python23\lib\codecs.py", line 314, in readlines
>>>    return self.decode(data, self.errors)[0].splitlines(1)
>>>UnicodeDecodeError: 'utf8' codec can't decode bytes in position 168-170:
>>>invalid data
> 
> 
>>This is fairly strange as the line
>> en = i.split('\t')[0]
>>should not call any method in codecs. I don't know how you can get such a
>>stack trace.
> 
> The file f where en comes from does contain lots of lines with one english
> word followed by a tab and a norwegian one. (Approximately 25000 lines) It
> can look like this: core\tkjærne

Yes, I understand that.

> So en is supposed to be the english word that the program need to find in
> MS Word, and to is the replacement word. So wouldn't that be a string that
> should be handeled by codecs?
> 
>         for i in self.f.readlines():
>             en = i.split('\t')[0]

The thing is, it's the line
  for i in self.f.readlines():
that is calling the codecs module, not the line
  en = i.split('\t')[0]
but it is the latter line that is in the stack trace.

Can any of the other tutors make any sense of this stack trace?
> 
>>The actual error indicates that the input data is not valid utf-8. Are
> 
> you >sure that is the correct encoding for the input file? If the file is
> utf-8 >and has bad characters you could pass error='ignore' or
> error='replace' as >a parameter to codecs.open() to change the error
> handling style to >something more forgiving.
> 
> Is not valid utf-8? I have tried with latin-1 as well. No avail. The
> letters that are the problem is æøå. They shouldn't be that exotic?

Not that exotic, no, but they have different representations in latin-1 and utf-8, and maybe other latin-x as well.

If you don't know the encoding of your input file, you need to figure it out before you do anything else.
> 
> 
>>>Traceback (most recent call last):
>>>  File
>>>"C:\Python23\Lib\site-packages\pythonwin\pywin\framework\scriptutils.py",
>>>line 310, in RunScript
>>>    exec codeObject in __main__.__dict__
>>>  File "C:\Python\BA\Oversett.py", line 49, in ?
>>>  File "C:\Python\BA\Oversett.py", line 33, in kjor
>>>    if t % 1000 == 0:
>>>UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 17:
>>>ordinal not in range(128)
> 
> 
>>Again this stack trace doesn't make sense, the indicated line doesn't do
>>any string operation.
> 
> 
>>This error message normally occurs when a non-ascii string is converted
> 
> to >unicode using the default encoding (which is 'ascii'). Often the
> 
>>conversion is implicit in some other operation but I don't see any such
>>operation here.
> 
> 
> But regardless, shouldn't 'ascii' be excluded here? Since I tell the
> program to change to utf-8, not only once but twice?

if the stack trace made sense I would have a better answer for you.

>>>objSelection.Find.Execute() is supposed to accept any kind of string. (It
>>>is the function Search & Replace in MS Word).

I doubt it. This function has no way to distinguish utf-8 from latin-1 or latin-2 or ebcdic or whatever. If you are giving it encoded strings, it has to make some assumption about the encoding. Or else it is expecting unicode strings.

If this is the first time you have had to deal with different encodings you might want to read
http://www.joelonsoftware.com/articles/Unicode.html
or any of the other articles referenced at the end of this essay:
http://www.pycs.net/users/0000323/stories/14.html

Kent
-- 
http://www.kentsjohnson.com



More information about the Tutor mailing list