How to convert unicode

Terry Hancock hancock at anansispaceworks.com
Thu Sep 26 15:08:28 EDT 2002


Hi Ladislav,
I think you're going to have to be a little more specific about
what exactly you want before people can help...

Ladislav:
> I have a script that downloads a webpage with  characters that are not plain ascii.From the
> web page header I can find out that there is UTF-8 coding( Unicode). I can display the page
> properly in a web browser if I send from my script
> "Content-Type: text/html; charset=UTF-8 "

Right, you're telling your browser what kind of data its receiving,
and it knows how to display it.

> But I need to save the page( the text of the page) to a file and some time later import the text
> to a database (MS Access or similar).
> How can I download the file or write to a  file so that the characters will be properly display in
> the file and in the database as well?
> Thanks for help

This is actually sort of meaningless, since databases and
files don't "display" data, they just "store" it. In general
they don't care whether that data is 7,8 or 16 bits wide,
as long as it's provided as a nice stream of bytes. (There
are exceptions -- I think some databases might trim off the
high bit, disallow "control characters", or nulls (0x00) for
"text" fields, but even those usually provide a "binary" data
type that won't mangle your data -- I can't address what 
MS Access does because I've never used it).

The question therefore is what do you want to use to *view*
the data in the file or database, and you need to specify
that.  If you want to view it with "basically everything", then
you'll need to sacrifice portability and capability by converting
to some localized encoding (the most common being ASCII, but
that may not be the most appropriate for where you are).
If that's okay, then this is a character-translation problem,
and the result will *lose data*, which you need to be sure is okay.
(This is because many applications still only support ASCII-style
8-bit encoding, despite much progress.  If nothing else, lots
of people are still using older software).

I'm guessing you want to preserve all the data though, in which
case, you're not talking about translating the data in the file
or database at all. You instead need to be asking what software
you can use (or how to configure your existing software if it
can already do it) to correctly display UTF8 encoded unicode.
You sacrifice some flexibility on the display side, but there are
a lot of unicode-aware software out there already.

If for some reason the software supports one of the other unicode
encodings, but not UTF8, then you'll need to do a unicode-to-unicode
translation, which won't lose data, but only change its
representation. UTF8 is widely supported, though, so I think this
isn't that likely.

As you can see, we'd have to start making a lot of guesses to
be more helpful than that (and I'd probably guess wrong!), so
I'll stop and give you a chance to respond with what you
want/need to use for display. :-D

Cheers,
Terry

-- 
------------------------------------------------------
Terry Hancock
hancock at anansispaceworks.com       
Anansi Spaceworks                 
http://www.anansispaceworks.com 
P.O. Box 60583                     
Pasadena, CA 91116-6583
------------------------------------------------------




More information about the Python-list mailing list