Changing the default text codec
Fuzzyman
michael at foord.net
Mon Feb 23 10:21:29 EST 2004
Paul Prescod <paul at prescod.net> wrote in message news:<mailman.193.1077530419.27104.python-list at python.org>...
> Fuzzyman wrote:
> > Sorry if my terminology is wrong..... but I'm having intermittent
> > problems dealing with accented characters in python. (Only from the 8
> > bit latin-1 character set I think..)
>
> I would say that if you get a 100% failure rate in IDLE and a 100%
> success rate from a console program then your problem is not
> intermittent but environment specific.
If that was the case then I'm sure you'd be right... good not to
quibble about terminology eh ;-)
(in a few other test cases the success-fail pattern was the opposite
way round)
>
> > For example - if I run my program from IDLE and give it the word
> > 'degri' (containing e-acute) then I get the error :
>
> What do you mean "give it the word". Through raw_input()? Through a file?
>
Right - it is fetching the words from a Tkinter entry box using the
get() method.
> However you are getting this information, it seems to me that in IDLE
> you are getting a Unicode object rather than an 8-bit string object.
> Convert it to an 8-bit string:
>
> mydata.encode("latin-1")
Great - that might do the job.
I'll try it.
Thanks.
>
> > if letter in self.valid_letters:
> > UnicodeDecodeError: 'ascii' codec can't decode byte 0x83 in position
> > 26: ordinal not in range(128)
>
> Something looks suspicious here. I wouldn't expect self.valid_letters to
> have a 0x83 character in it because I would expect it to be hard-coded
> to ASCII in your program like:
>
Self.valid_letters *in fact* is string.lowercase - which I thought
included the 8 bit latin-1 letters as well. (the letters are converted
to lowercase by using the .lower() string method )
> valid_letters = "abcdefghijklmnopqrstuvwxyzABCDEF..."
>
> On the other hand I wouldn't expect "letter" to have more than one
> character so how could it have a problem at position 26?
>
I'm iterating over the string.
> > What I'd like to do is switch by default to an 8 bit codec (latin-1 I
> > think ?????) and then offer the user the choice of either mapping the
> > accented characters to their nearest equivalent (e-acute to e for
> > example) *or* treating them as seperate characters.............
>
> Why change the default codec rather than explicitly using the codec you
> care about? If you want to work in the 8-bit world rather than the
> Unicode world, just use the "encode" function on the Unicode object. If
> you want to work in the Unicode world.
>
Great - sounds good.
> > I can't work out how to change the default codec (no matter what the
> > locale) ?
>
> I'd advise against fixing the problem in that way. Convert data
> appropriately when you bring it from the outside world into the Python
> program and ignore the default codec.
>
> Paul Prescod
Thanks for your help.
Fuzzyman
http://www.voidspace.org.uk/atlantibots/pythonutils.html
More information about the Python-list
mailing list