A Unicode problem -HELP

Ben Finney bignose+hates-spam at benfinney.id.au
Wed May 17 08:20:14 CEST 2006


"manstey" <manstey at csu.edu.au> writes:

> 1. Here is my input data file, line 2:
> gn1:1,1.2 R")$I73YT R")$IYT at ncfsa

Your program is reading this using the 'utf-8' encoding. When it does
so, all the characters you show above will be read in happily as you
see them (so long as you view them with the 'utf-8' encoding), and
converted to Unicode characters representing the same thing.

Do you have any other information that might indicate this is *not*
utf-8 encoded data?

> 2. Here is my output data file, line 2:
> u'gn', u'1', u'1', u'1', u'2', u'-', u'R")$I73YT', u'R")$IYT',
> u'R")$IYT', u'@', u'ncfsa', u'nc', '', '', '', u'f', u's', u'a', '',
> '', '', '', '', '', '', '', u'B.:R")$I^YT', u'b.:cv)cv^yc', '\xc9\x94'

As you can see, reading the file with 'utf-8' encoding and writing it
out again as 'utf-8' encoding, the characters (as you posted them in
the message) have been faithfully preserved by Unicode processing and
encoding.


Bear in mind that when you present the "input data file, line 2" to
us, your message is itself encoded using a particular character
encoding. (In the case of the message where you wrote the above, it's
'utf-8'.) This means we may or may not be seeing the exact same bytes
you see in the input file; we're seeing characters in the encoding you
used to post the message.

You need to know what encoding was used when the data in that file was
written. You can then read the file using that encoding, and convert
the characters to unicode for processing inside your program. When you
write them out again, you can choose the 'utf-8' encoding as you have
done.

Have you read this excellent article on understanding the programming
implications of character sets and Unicode?

    "The Absolute Minimum Every Software Developer Absolutely,
    Positively Must Know About Unicode and Character Sets (No
    Excuses!)"
    <URL:http://www.joelonsoftware.com/articles/Unicode.html>

-- 
 \     "I'd like to see a nude opera, because when they hit those high |
  `\   notes, I bet you can really see it in those genitals."  -- Jack |
_o__)                                                           Handey |
Ben Finney




More information about the Python-list mailing list