Ascii to Unicode.

Ethan Furman ethan at stoneleaf.us
Thu Jul 29 14:14:24 EDT 2010


Joe Goldthwaite wrote:
> Hi Steven,
> 
> I read through the article you referenced.  I understand Unicode better now.
> I wasn't completely ignorant of the subject.  My confusion is more about how
> Python is handling Unicode than Unicode itself.  I guess I'm fighting my own
> misconceptions. I do that a lot.  It's hard for me to understand how things
> work when they don't function the way I *think* they should.
> 
> Here's the main source of my confusion.  In my original sample, I had read a
> line in from the file and used the unicode function to create a
> unicodestring object;
> 
> 	unicodestring = unicode(line, 'latin1')
> 
> What I thought this step would do is translate the line to an internal
> Unicode representation.  The problem character \xe1 would have been
> translated into a correct Unicode representation for the accented "a"
> character. 

Correct.  At this point you have unicode string.

> Next I tried to write the unicodestring object to a file thusly;
> 
> 	output.write(unicodestring)
> 
> I would have expected the write function to request the byte string from the
> unicodestring object and simply write that byte string to a file.  I thought
> that at this point, I should have had a valid Unicode latin1 encoded file.
> Instead get an error that the character \xe1 is invalid.

Here's the problem -- there is no byte string representing the unicode 
string, they are completely different.  There are dozens of different 
possible encodings to go from unicode to a byte-string (of which UTF-8 
is one such possibility).

> The fact that the \xe1 character is still in the unicodestring object tells
> me it wasn't translated into whatever python uses for its internal Unicode
> representation.  Either that or the unicodestring object returns the
> original string when it's asked for a byte stream representation.

Wrong.  It so happens that some of the unicode points are the same as 
some (but not all) of the ascii and upper-ascii values.  When you 
attempt to write a unicode string without specifying which encoding you 
want, python falls back to ascii (not upper-ascii) so any character 
outside the 0-127 range is going to raise an error.

> Instead of just writing the unicodestring object, I had to do this;
> 
> 	output.write(unicodestring.encode('utf-8'))
> 
> This is doing what I thought the other steps were doing.  It's translating
> the internal unicodestring byte representation to utf-8 and writing it out.
> It still seems strange and I'm still not completely clear as to what is
> going on at the byte stream level for each of these steps.


Don't think of unicode as a byte stream.  It's a bunch of numbers that 
map to a bunch of symbols.  The byte stream only comes into play when 
you want to send unicode somewhere (file, socket, etc) and you then have 
to encode the unicode into bytes.

Hope this helps!

~Ethan~



More information about the Python-list mailing list