[Expat-discuss] Encoding lower 32 characters

Paul Prescod paulp@ActiveState.com
Mon, 30 Apr 2001 15:08:12 -0700


Michael Wissner wrote:
> 
> ...
> 
> Since I find it hard to believe that certain US-ASCII characters were
> omitted from Unicode, my next guess is that the intent of the XML spec is to
> say that those special characters are not valid in an XML file; that a valid
> XML file should encode those characters using character references such as
> "" so that they don't appear literally in the file.

"Well-Formedness Constraint: Legal Character

Characters referred to using character references must match the
production for Char."

[2]  Char ::=  #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] |
[#x10000-#x10FFFF] 
/*  any Unicode character, excluding the surrogate blocks, FFFE, and
FFFF. */ 


> ... Is it a bug in the XML spec?  

Well, it is intentional, but you could argue that it is a wrong
intention. :)

> ... If it's
> correct, how can I transmit application data that contains these characters?
> Clearly I can create my own application-level escaping mechanism, but
> doesn't this defeat the purpose of having an application-independent
> standard like XML?

It defeaturs part of the purpose but encoding "control characters" is
actually pretty rare. You could make the argument that "<", ">" and "&"
are XML's control characters so the others would be redundant. If you
want to insert a NAK or ESC , I'd suggest <NAK/> or <ESC/> and so on.

You could even standardize your encoding for these characters. :)
-- 
Take a recipe. Leave a recipe.  
Python Cookbook!  http://www.ActiveState.com/pythoncookbook