A challenge to the ASCII proponents.

Bengt Richter bokr at oz.net
Mon Jul 21 23:03:17 CEST 2003

On Mon, 21 Jul 2003 10:51:32 +0100, Alan Kennedy <alanmk at hotmail.com> wrote:

>I don't want to go on and on about this, and I'm happy to concede that
>some of my points are far from proven, and others are disproven.
>However, there are one or two small points I'd like to make.
Ditto ;-)

>Ben Finney wrote:
>>>> Which quickly leads to "You must use $BROWSER to view this site".
>>>> No thanks.
>Alan Kennedy wrote:
>>> No, that's the precise opposite of the point I was making.
>Ben Finney wrote:
>> You also stipulated "... from a Usenet post".  Most Usenet readers
>> do not handle markup, nor should they.  There are many benefits from
>> the fact that posts are plain text, readable by any software that can
>> handle character streams;
>1. While there may be benefits from posts being plain text, there are
>also costs. The cost is a "semantic disconnect", where related
>concepts are not searchable, linkable or matchable, because their
>character representations are not comparable.
don't want some awful garbage like the above line in postings?
Especially since the bulk of it would probably be automatically
generated an MS NLP feature ;-/

>2. I chose the "from a usenet post" restriction precisely because of
>the 7-bit issue, because I knew that 8-bit character sets would break
>in some places. It was an obstacle course.
I see this as a separate issue from semantics, though. Encoding consistent
signs for identical things is a different problem from handling and encoding
the *meaning* of the things indicated.

>> parsing a markup tree for an article is a whole order
>> of complexity that I'd rather not have in my newsreader.
>> Expecting people to use a news reader that attempts to parse markup
>> and render the result, is like expecting people to use an email reader
>> that attempts to parse markup and render ther result.  Don't.
>I don't expect people's newsreaders or email clients to start parsing
>embedded XML (I nearly barfed when I saw Microsoft's "XML Data
>Islands" for the first time).
>What I'm really concerned about is the cultural impact. I voluntarily
>maintain a web site for an organisation that has members in 26
>countries, who not surprisingly have lots of non-ASCII characters in
>their names. Here's one:
Ok, but, do we need to embed a full markup language to handle small
encoding exceptions, it that's the real concern? (IMO, no ;-)

>Because of the ASCII restriction in URLs, I was only able to offer Dr.
>Pavlík the above uri, or this:
>which sucks.

Which part, though? The encoding, or the fact that you see the encoding
in the above instead of its being rendered with the intended appearance?

IOW, any solution will involve *some* encoding and the possibility of
rendering it "raw" or interpreted. A smart GUI might have a default
mode showing everything interpreted, and have a "view source" button.

But "any solution" is not where we are. I think most people would object to
getting this kind of stuff in what appears to be just visually enhanced mail,
if they were aware every time it happened:
(I hope nobody's too-smart-for-its-own-good viewer tries to see the following
as actual MIME content ;-/)

MIME-Version: 1.0
Content-Type: multipart/alternative; boundary="----000000000000000000000000000000000000000000000000000000000000000"

<A HREF="http://ad.doubleclick.net/jump/N2870.or/B914513.8;sz=1x1;ord=[timestamp]?">
<IMG SRC="http://ad.doubleclick.net/ad/N2870.or/B914513.8;sz=1x1;ord=[timestamp]?" BORDER=0 WIDTH=1 HEIGHT=1 ALT=""></A>
The trouble is that any automated following of references out of received information
is effectively a stealth channel for info, if only to inform that your computer processed
the message. Not to mention cookie stuff, and exploitation of real security holes.

The defense of filtering out requests to alternate server sources may be a reasonable
compromise for web viewing, but IMO such defenses should not be necessary in email.

But the large majority of users will use MS stuff with whatever defaults MS decided
good for something. So an email winds up doing a lot more than presenting different
language encodings and fonts well.

That's why alternatives have appeared, but what happens if an html chunk is handed to
a MS system DLL to do rendering? How limited is the interpretation? Should it all be
rewritten from scratch and duplicate lots of stuff already available? What can safely
be handed off? A whole email preview pane presentation?

>Little wonder then that the next generation are choosing to explicitly
>remove the accents from their names, i.e. his colleague Dr. Machackova
>explicitly asked to have the accents in her name removed. Although I
>assured her that her name would be correctly spelled, on web sites
>that I maintain, the fact that her name breaks continually with
>various ASCII centric technologies makes her think it's not worth the
>hassle, or worth the risk of searches for her name failing.
That is a complaint about the current state of affairs, though.
The problem is migration to new tools without unacceptable backwards breakage.
Unfortuantely, the lowest common denominator tends to be a breakage solution.

>And what about Dr. Sigurðardóttir, Dr. Djønne, and Dr. de la Cruz
>Domínguez Punaro? Are they destined to be passed over more often than
>ASCII-named people?
>[BTW, I've written the above in "windows-1252", apologies if it gets
>Solely because of technical inertia, and unwillingness to address the
>(perhaps excessive) complexity of our various communications layers,
>i.e. our own "Tower of 7-bit Babel", we're suppressing cultural
>diversity, for no technically valid reason.
>I personally don't have the slightest problem with reformulating NNTP
>and POP to use XML instead: In a way, I think it's almost inevitable,
>given how poor our existing "ascii" technologies are at dealing with
>i18n and l10n issues. Emails and usenet posts are all just documents
>after all.
Right, but are they multimedia presentations? I like the option to have
the latter, but only by optional and intentional linkage following, and
rendered by a separate invoked tool, not as an email reader built-in.

That might seem like a fine distinction, but it is easier to trust a limited
tool with well-defined and manual control transfers to other functionality
than it is to trust and automated doitall. That is the comfort of plain ascii, IMO,
and for many that comfort is worth a fair amount of other discomfort.

If the information for the other tool has to be embedded, we have MIME attachments,
but IMO they should not be delivered by default. ISTM having a selection of simple
separate tools that do limited things only on manual command would be better for email
than having it be an instance of general purpose XHTML processing.

>Would something like this really be so offensive (the Gaelic isn't, I
>promise :-)? Or inefficient?
><?xml version="1.0" encoding="windows-1252"?>
>  <subject>An mhaith l'éinne dul go dtí an nGaillimh Dé
>  <from>aláin ó cinnéide</from>
>  <to>na cailíní agus na buachaillí</to>

Hm, your post arrived to me with this in the header:
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: 8bit

ISTM the problem is the dual aspect of some meta-data (e.g. headers).
I.e., to recognize meta-data, something has to assume an encoding for *it*.
The simplest assumption is a fixed standard assumption, like ascii.
This dual-aspect problem appears also in file systems, where a box handle
also serves as a box label and loose content-type indicator.

If you want meta-data also to be GUI-presentation-encoded data you
are getting away from a standard assumption, or perhaps substituting
the utf-8 standard assumption of XML. If the latter, why not just let
that be it, without involving the tagged markup cruft of XML, unless
you have specific goals for the markup per se? I can see such goals,
but I don't think they belong in normal email, and to do it just to
identify header elements is IMO an unclean solution to what  rfc2822
already does much more readably (taking glyph encoding as a separate issue).

The problem with multiple encoding declarations is that they have to be
recognized as meta-data. They belong *on* the box, not in it. The only
way to be in a box is to be on a box inside a box that can contain others
in a way that separates the contained boxes, like multi-part MIME sections.

Otherwise an nested encoding declaration would have to be *itself* encoded
in whatever the current encoding was. But then you have to decide that it
wasn't just peculiar data, and you have to invent an escape, etc...

Just switching to utf-8 as a standard assumption for "box labels"
(e.g., email headers and file names etc.) would IMO go a long way
towards avoiding XML for email bodies. Thus

From: Alan Kennedy <alanmk at hotmail.com>
Organization: just me.
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: 8bit
would be encoded in utf-8, but after the blank line that ends the header,
it would be assumed text/plain; charset=iso-8859-1.

I'm sure people like Martin have given this a lot more thought (not to mention
relevant implementation work ;-)  than I have, so I would not be surprised to
have important issues I've disrgarded pointed out.

For special effects in emails, I could see borrowing the XML "processing instruction"
escape, of which <?xml ... ?> itself is an example. I.e., That syntax means invoke
the named processing, passing ... up to the ?> as arguments to the processing program.

I wouldn't want just any of these to take off automatically, though. Imagine, e.g.,

    import os
    os.system( '...nastiness...')

A smart email reader could hide <?xxx ...?> as a clickable xxx and let you right-click
to see the source, but an old reader would just show it as above.

Embedded pictures could e.g. be specific to a base64-encoded (and *that* represented in the
current encoding, along with the entire <? ... ?>) gif, like
<?gif ...base64 stuff...?> Clicking the highlighted "gif" might pop up a picture
in a child window, etc.

<?svg ... ?> Might be interesting also. All these things could be designed to operate
on immediate data vs referred-to data, where the latter could refer to attachments or
urls or whatever.

This mechanism could also be used to solve the gignooskoo problem, something like
(here with immediate data):

<?utf8 &#947;&#943;&#947;&#957;&#969;&#963;&#954;&#969;?>

Where that whole thing is encoded within the current message in its current encoding
(which can't be violated by including actual utf-8 characters not in common)
but is seen by the smart email reader as invokng "utf8" processing. That might be one
that one would elect to have the reader do automatically.

Again, for email readers that don't understand <?xxx ... ?> you just see it as above.

But converting email/news wholesale to an instance of XHTML, please no! ;-)

Bengt Richter

More information about the Python-list mailing list