[XML-SIG] sgmlop and html parsing

"Martin v. Löwis" martin at v.loewis.de
Wed Jan 14 14:03:35 EST 2004


Walter Dörwald wrote:
> Wouldn't it make sense to implement an SGMLParser that supports
> unicode?

No. In SGML, the SGML declaration defines the document encoding, e.g.

CHARSET

         BASESET
   "ISO 646:1983//CHARSET International Reference Version (IRV)//ESC 2/5 
4/0"
         DESCSET
                     0   9   UNUSED
                     9   2     9
                    11   2   UNUSED
                    13   1    13
                    14  18   UNUSED
                    32  95    32
                   127   1   UNUSED

         BASESET
   "ISO Registration Number 100//CHARSET ECMA-94 Right Part of Latin 
Alphabet Nr. 1//ESC 2/13 4/1"
         DESCSET
                   128  32   UNUSED
                   160  96   32

So to understand a character reference, you have to know the SGML
declaration. It is Unicode only if the declaration says

      CHARSET
          BASESET
              "ISO Registration Number 177//CHARSET
               ISO/IEC 10646-1:1993 UCS-4 with implementation
               level 3//ESC 2/5 2/15 4/6"


Regards,
Martin




More information about the XML-SIG mailing list