Unicode from Web to MySQL

Francis Avila francisgavila at yahoo.com
Sat Dec 20 22:44:18 EST 2003


Bill Eldridge wrote in message ...
>Skip Montanaro wrote:
>Encoding for example is a UTF-8 page Vietnamese,
>try:
>
> http://www.rfa.org/service/index.html?service=vie
>or
> http://www.rfa.org/service/article.html?service=vie&encoding=9&id=123655
>
>I've tried grabbing this, doing vietstring.decode(None,'strict')
>gives an error (wants a string, not None), doing
>unicode(data,'unicode','replace') fails,
>unicode(data,'raw-unicode-escape','replace') somewhat works,
>I can then try
>unicode(data,'raw-unicode-escape','replace').encode('utf-8')
>but I get a SQL error at that point.

You still have not understood the crucial lession: Unicode is *not* *an*
*encoding*.  Not an encoding!

Immediate logical ramification: UTF8 (or whatever other encoding you wish to
name) IS NOT UNICODE!

Let's look at each of your attempts and see why each makes no sense:

>>> vietstring.decode(None, 'strict)
How can a str be decoded from nothing? If it has no encoding, it's just raw
bytes with no meaningful interpretation of those bytes.  And now you want a
unicode object to be magically produced?

What you should say is, "Ok, I know that vietstring is utf8 encoded, so to
decode it (to a unicode object), I guess I'll have to tell Python
vietstring.decode('utf8'), meaning 'Decode vietstring from utf8.'"

>>> viet =
urllib.urlopen('http://www.rfa.org/service/index.html?service=vie')
>>> vietstr = viet.read()
>>> type(vietstr) # Raw bits; no intrinsic meaning
<type 'str'>
>>> vietunicode = vietstr.decode('utf8')
>>> type(vietunicode) # Raw intrinsic meaning; no bits.
<type 'unicode'>
>>>

unicode and str are diametrically opposed views of reality.  Unicode is the
rationalist--there's no reality outside of meaning (i.e., no bits).  Str is
the empiricist--there's only raw bits, and the only meaning is what you give
them.

>>> unicode(data, 'unicode', 'replace')
You want a unicode object to be produced from data, which you declare as
being in the 'unicode' encoding.  But there's no such encoding!  Unicode is
*not* an encoding!  Unicode is more abstract than bytes.  Do not ever think
of bytes and unicode in the same thought.

>>> unicode(data, 'raw-unicode-escape', 'replace')

This may seem to work, but really its exactly the same as ur'<contents of
data>'--its treating data as though it were a raw unicode literal:

>>> s = '\\u1234'
>>> len(s)
6
>>> us = unicode(s, 'raw-unicode-escape')
>>> us
u'\u1234'
>>> len(us)
1

This is not what you want! So
unicode(data,'raw-unicode-escape','replace').encode('utf-8') is the
utf8-encoded str of what you didn't want in the first place!

vietstring.decode('utf8') will give you what you want, namely, a unicode
object.  Before you feed the unicode object to SQL, encode it to utf8 (a str
object).   This part you seem to understand just fine, but you have some
sort of mental block against recognizing that you need to decode the string
you got from the web before you can get a unicode object!

In this particular case, (where it's already utf8) you can put vietstring
straight into the SQL database as you found it, without doing any conversion
at all.  But this is only because the raw bits are the same they would have
been if you had decoded to pure unicode and then encoded to utf8.

To make sure that all your problems are with Python unicode<->str conversion
confusion, and NOT with SQL, try placing vietstring straight into SQL
without touching it.
--
Francis Avila







More information about the Python-list mailing list