[Tutor] unicode files

dman dman@dman.ddts.net
Fri, 3 May 2002 11:31:15 -0500


--CE+1k2dSO48ffgeK
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Fri, May 03, 2002 at 03:30:41PM +0100, steve king wrote:
| Hello,
|=20
| I am relatively new to Python so don't laugh if what I ask is painfully=
=20
| obvious... (or just plain wrong!!!)
|=20
| I have worked out how to open a file and how to write to it, but I need t=
he=20
| file to be in unicode. I know there is a unicode() function but there isn=
't=20
| a great deal of documentation on how to use it.
|=20
| Is it a case of creating the file, writing to it and then saving it as=20
| unicode, or would I be better off writing the strings as unicode and then=
=20
| saving the file?

At the programmer level there is no such thing as "saving a file".
You can write data to a file, and that's "all".  The term "save as"
only really applies to user interfaces where the user must pick which
branch of the program to have write the data to the file.

| At the moment I have something like this:
|=20
| import sys
|=20
| starting_dir=3Draw_input("Dir and file please: ")
| somename=3Dstarting_dir
| file=3Dopen(somename, 'w')
| file.write("The first line of a file")
| file.close()

This works as long as the bytes in that string literal are exactly
what you want in the file.  The only difference when using unicode,
for example, is converting that sequence of bytes/characters to the
sequence of bytes that you want.

Since unicode characters are multi-byte, various serialization schemes
exist to allow putting the characters into a byte stream (file).  Some
of those encodings are UCS-4 UCS-2 UTF-16 UTF-7 and UTF-8.  UTF-8 is
the most widely known and has many advantages over the other
encodings.  If you want to read/write UTF-8 encoded data the following
code will do it :

f =3D open( "the_file" , "r+" )
f.write(   "some data\n".encode( 'utf-8' )   )
f.seek( 0 )
raw_data =3D f.readline()
unicode_string =3D raw_data.decode( 'utf-8' )
print repr( unicode_string  )
f.close()

This example isn't very interesting, though, because UTF-8 is designed
so that US-ASCII is a proper subset of it.  IOW, for characters
0x00-0x7f, US-ASCII and UTF-8 are identical (and so is ISO-8859-*).

To make this more interesting use some data that lies outside the
US-ASCII range.  If you try to print the string, though, you'll likely
get an error such as :

>>> print u"\u20ac"

Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeError: ASCII encoding error: ordinal not in range(128)
>>>=20

The reason for this is python doesn't know how I want to encode
(serialize) the characters for output to my display.  I can print the
repr() of the string, though, since that only uses ASCII characters.

>>> print repr( u"\u20ac" )
u'\u20ac'
>>>

HTH,
-D

--=20

If Microsoft would build a car...
=2E.. Occasionally your car would die on the freeway for no reason. You
would have to pull over to the side of the road, close all of the car
windows, shut it off, restart it, and reopen the windows before you
could continue. For some reason you would simply accept this.
=20
GnuPG key : http://dman.ddts.net/~dman/public_key.gpg


--CE+1k2dSO48ffgeK
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (GNU/Linux)
Comment: For info see http://www.gnupg.org

iEYEARECAAYFAjzSu1MACgkQO8l8XBKTpRT8jwCfaj9k8iZkSY2vQJieyLEA+6jp
Bp4AoKZRMuARleTU2JWhzgaELg+OzESE
=H/F8
-----END PGP SIGNATURE-----

--CE+1k2dSO48ffgeK--