[Tutor] clob and string conversion

Sun Jun 26 07:41:08 CEST 2005

On Sat, 25 Jun 2005, Ming Xue wrote:

> I am trying to access oracle9i with cx_Oracle package. One of the column
> is clob and when I tried clob.read() I got non-text output. I know the
> real data is text. Can somebody show me how to convert clob to string in
> python?

Hi Ming,

This seems really specialized to the cx_Oracle database module; you may
want to see if the folks at the Database Special Interest Group (DB-SIG):

    http://www.python.org/sigs/db-sig/

because the folks there are probably more aware of some issues one needs
to think about with CLOBs: we at Tutor might not necessary have the
special experience that we need to give you the best help.

I've been looking at the definition of the LOB object interface,

    http://starship.python.net/crew/atuining/cx_Oracle/html/lobobj.html

and it does seem like clob.read() should do the trick.  Without seeing
what you're doing, it's very difficult to know what's going on.  Do you
mind showing us an example of the text that you're getting back from the
CLOB, and what you're expecting to see?

You do mention that you're getting something back, but it doesn't look
like text to you... Ok, my best guess so far is that you may be seeing a
text encoding that you're not expecting.  I'll go with that guess until we
know more information.  *grin* If you can show us that example, we'll have
a better basis for testing the hypothesis.

Do you know what text encoding the strings in your database are in?  Are
they in Latin-1, or perhaps in a Unicode format such as UTF-8 or UTF-16?

The reason I ask is because it's not at all obvious from looking at bytes
alone how to interpret them, so it's possible that you may need to
"decode" those bytes by telling Python explicitely what encoding to use.
CLOBs are binary streams, so Python will never impose an interpretation on
those bytes until we tell it to.

For example, if one were to give us the byte string:

######
secret = '\xfe\xff\x00h\x00e\x00l\x00l\x00o\x00 \x00w\x00o\x00r\x00l\x00d'
######

then we'd probably be quite unhappy until we also knew that those bytes
represented a string in utf-16 encoding:

######
>>> secret.decode('utf-16')
u'hello world'
######

For a more general introduction to unicode encodings, you may want to look
at Joel Spolsky's amusing article on "The Absolute Minimum Every Software
Developer Absolutely, Postitively Must Know About Unicode and Character
Sets (No Excuses!)":

    http://www.joelonsoftware.com/articles/Unicode.html

Python's supported list of encodings are listed here:

    http://www.python.org/doc/lib/standard-encodings.html

Best of wishes to you!