[XML-SIG] (Py)DOM: Character References

Carsten Oberscheid co@daisybytes.su.uunet.de
Fri, 19 Mar 1999 10:52:18 +0100


>
> * Carsten Oberscheid
> |
> | Ok, since charrefs encode only characters from the document's base
> | character set (Unicode for XML, ASCII for SGML -- is that right?)
>
> No.  XML uses Unicode, but since XML is SGML (an SGML application
> profile, to be correct), it follows that this isn't true.  And in fact
> SGML as a meta-language does not have a fixed document character set.
> In fact, the SGML declaration allows you to define your own character
> set in terms of well-known character sets.

Allright, I should have said "SGML according to the standard declaration a.k.a. 
reference concrete syntax" ;^)

>
> So, SGML can use Unicode/ISO 10646, as for example HTML 4.0 does[1],
> but it can also use any other character set which consists of
> well-known characters. It also has standard ways of handling
> characters that are not in the character sets. However, I don't think
> it can handle every character encoding, but I might be wrong.

But that leads be back to my original train of thought. Guess I'm processing 
SGML/XML/HTMLx.x documents on a system that can't cope with the documents' full 
character set, e.g. it can display ASCII only. Since the source and the target 
systems are not limited that way, I don't want to restrict the character set 
itself. I just want, in my intermediate processing, to consequently represent 
the non-ASCII characters as character references.

As far as I can see from my zen level (I'm down hee-eeere!!), the DOM doesn't 
know about charrefs, and PyDOM expects them to be resolved (which xmlproc, for 
example, silently does). All I can do is to tell the XML lineariser to 
translate certain characters back to charrefs on output. But as I type this 
(learning by chatting away, hope you don't mind...) I see that this should be 
ok, since, to be XML (or SGML) conformant, my system (and the DOM 
implementation and the parser and so on) MUST be able to cope with the full 
charset internally.

Hope I got this right now in my small brain, and thanks for making me think 
about it again.

>
> --Lars M.

.co.

+------------------------------------------------------- daisy bytes! --------+
 Carsten Oberscheid
 co@daisybytes.su.uunet.de                        digital document processing
 http://www.pweb.de/daisybytes.su                     electronic publishing