CGI Unicode issue?

Tim Roberts timr at probo.com
Mon Apr 2 00:19:05 EDT 2001


This post is off-topic for this newsgroup, but there is a fair amount of
CGI experience here, and I couldn't come up with a precisely appropriate
newsgroup.

I  have a web site controlled entirely by Python scripts, using HTMLgen.py
and cgi.py as my primary tools.  The site is used by a large number of
ordinary users, and a couple of administrators with more full access.

This week, one of the administrators started triggering assertion errors in
my scripts in response to a form method=POST.  As I dump the names and
values in my cgi.FieldStorage object, I see the following:

 id+AAA- = (<type 'instance'> FieldStorage('id+AAA-', None, 'p01682+AAA-')
 username+AAA- = (<type 'instance'> FieldStorage('username+AAA-', ...
 subject+AAA-- = (<type 'instance'> FieldStorage('subject+AAA--', ...

and so on.  The key problem here is the "+AAA-" strings, which should not
be there.  I look for the required "id" key in the cgi.FieldStorage object,
and since "id+AAA-" does not match, this triggers the assertion.  Later
members have e-mail bodies; all special characters in those bodies have
been replaced by these codes ($ => +ACU-,  & => +ACY-, etc.).

After a fair amount of web digging, I've learned that these codes are
base64-encoded Unicode-16 characters.  AAA is a base64 encoding for a
16-bit null.  ACU is base64 encoding for 0024, the Unicode-16 dollar sign.
The RFC on UTF-7 encoding mentions this +XXX- encoding.

However, I don't understand why I see these.  There is nothing in the CGI
environment variables that indicates I should be expecting any kind of
unusual encoding.  There are no characters in the query string outside the
ASCII 128.  This user happens to be running Internet Explorer 5.5, and
running the same query from IE 5.0 produces correct results.  My fear is
that this is an IE 5.5 bug, but I can find no mention of anything similar
in a web search.

Later keys in cgi.FieldStorage get even stranger; the +AAA- is followed by
leftover characters from earlier strings.  You can already see that in the
"subject" member above; the second "-" is leftover from the "username"
member.  This gets worse in later members.  The last member, "Button", is
stored as:

  Button+AAA-AAA-ffice Space Available+AAA-yyy = (<type 'instance'>...

For the time being, I have modified cgi.py to look for the +AAA- string in
the name part of the query string; if found, I do a re.search for all
strings matching "+A..-" and pass them through binascii.a2b_base64, then
terminate the string at the first +AAA-.  This seems to have worked around
the problem, but I'm hoping someone can tell me they've seen this before,
and I'm missing something fundamental.
--
- Tim Roberts, timr at probo.com
  Providenza & Boekelheide, Inc.



More information about the Python-list mailing list