[Tutor] ascii to/from AL32UTF8 conversion
Steven D'Aprano
steve at pearwood.info
Mon Nov 23 05:37:00 EST 2015
On Sun, Nov 22, 2015 at 11:19:17PM -0500, bruce wrote:
> Hi.
>
> Doing a 'simple' test with linux command line curl, as well as pycurl
> to fetch a page from a server.
>
> The page has a charset of >>AL32UTF8.
I had never heard of that before, so I googled for it. No surprise, it
comes from Oracle, and they have made a complete dog's breakfast out of
it.
According to the answers here:
https://community.oracle.com/thread/3514820
(1) Oracle thinks that UTF-8 is a two-byte encoding (it isn't);
(2) AL32UTF8 has "extra characters" that UTF-8 doesn't, but UTF-8 is a
superset of AL32UTF8 (that's a contradiction!);
(3) Oracle's UTF-8 is actually the abomination more properly known as
CESU-8: http://www.unicode.org/reports/tr26/
(4) Oracle's AL32UTF8 might actually be the real UTF-8, not "Oracle
UTF-8", which is rubbish.
> Anyway to conert this to straight ascii. Python is throwing a
> notice/error on the charset in another part of the test..
>
> The target site is US based, so there's no weird chars in it..
I wouldn't be so sure about that.
> I suspect that the page/system is based on legacy oracle
>
> The metadata of the page is
>
> <META HTTP-EQUIV="Content-Type" NAME="META" CONTENT="text/html;
> charset=AL32UTF8">
>
> I tried the usual
>
> foo = foo.decode('utf-8')
And what happened? Did you get an error? Please copy and paste the
complete traceback.
The easy way to hit this problem with a hammer and "fix it" is to do
this:
foo = foo.decode('utf-8', errors='replace')
but that will replace any non-ASCII chars or malformed UFT-8 bytes with
question marks:
py> s = u"abc π def".encode('utf-8') # non-ASCII string
py> print s.decode('ascii', errors='replace')
abc �� def
which loses data. That should normally be considered a last resort.
It might also help to open the downloaded file in a hex editor and see
if it looks like binary or text. If you see lots of zeroes, e.g.:
...006100340042005600...
then the encoding is probably not UTF-8.
--
Steve
More information about the Tutor
mailing list