[Tutor] ascii to/from AL32UTF8 conversion

Steven D'Aprano steve at pearwood.info
Mon Nov 23 05:37:00 EST 2015


On Sun, Nov 22, 2015 at 11:19:17PM -0500, bruce wrote:
> Hi.
> 
> Doing a 'simple' test with linux command line curl, as well as pycurl
> to fetch a page from a server.
> 
> The page has a charset of  >>AL32UTF8.

I had never heard of that before, so I googled for it. No surprise, it 
comes from Oracle, and they have made a complete dog's breakfast out of 
it.

According to the answers here:

https://community.oracle.com/thread/3514820

(1) Oracle thinks that UTF-8 is a two-byte encoding (it isn't);

(2) AL32UTF8 has "extra characters" that UTF-8 doesn't, but UTF-8 is a 
superset of AL32UTF8 (that's a contradiction!);

(3) Oracle's UTF-8 is actually the abomination more properly known as 
CESU-8: http://www.unicode.org/reports/tr26/

(4) Oracle's AL32UTF8 might actually be the real UTF-8, not "Oracle 
UTF-8", which is rubbish.


> Anyway to conert this to straight ascii. Python is throwing a
> notice/error on the charset in another part of the test..
> 
> The target site is US based, so there's no weird chars in it..

I wouldn't be so sure about that.


> I suspect that the page/system is based on legacy oracle
> 
> The metadata of the page is
> 
> <META HTTP-EQUIV="Content-Type" NAME="META" CONTENT="text/html;
> charset=AL32UTF8">
> 
> I tried the usual
> 
> foo = foo.decode('utf-8')

And what happened? Did you get an error? Please copy and paste the 
complete traceback.

The easy way to hit this problem with a hammer and "fix it" is to do 
this:

foo = foo.decode('utf-8', errors='replace')

but that will replace any non-ASCII chars or malformed UFT-8 bytes with 
question marks:

py> s = u"abc π def".encode('utf-8')  # non-ASCII string
py> print s.decode('ascii', errors='replace')
abc �� def

which loses data. That should normally be considered a last resort.

It might also help to open the downloaded file in a hex editor and see 
if it looks like binary or text. If you see lots of zeroes, e.g.:

...006100340042005600...

then the encoding is probably not UTF-8.


-- 
Steve


More information about the Tutor mailing list