[DOC-SIG] What I don't like about SGML

Paul Prescod papresco@technologist.com
Tue, 18 Nov 1997 06:45:27 -0500


Guido van Rossum wrote:
> 
> First, while SGML may have been standardized in the swinging '80s, it
> definitely has its roots in the '70s -- it takes many years to become
> an international standard (look at C++!), and it started its life, as
> "GML", long before standardization started.  Undoubtedly some of the
> worse features in SGML were designed to be backwards compatible
> (again, very much like C++...).

I don't doubt that SGML has some backwards compatible features, but it
is *not* backwards compatible with GML. The backwards compatibility
features mostly exist for people who think that something like TIM is
the greatest thing in the world and want to remake SGML in its image. 

Anyhow TeX, and thus TeXInfo and thus TIM also have their "roots in the
70s." Big deal. As far as I'm concerned, Python has its roots in the 70s
too.
 
> 99.9% of the time, HTML is parsed by relatively simple handwritten
> parsers, not by generic SGML scanners.  There are lots of programs out
> there that have to parse HTML -- preprocessors, web browsers, web
> spiders, etc.  Why don't these just link to an existing SGML scanner?
> Because SGML scanners are *huge*.  They need to be big to scan generic
> SGML, which is a very complex language.  But most of this power isn't
> needed to scan HTML, so people roll their own parser.

That's true. That's why we should stick to an SGML subset. I propose XML
+ minimizations.
 
> But Berners-Lee made one mistake: he made HTML look a bit like SGML
> (which he had seen once or twice from a distance :-).  

Berners-Lee's only mistake is that he didn't research SGML enough before
making HTML so that he had a lot of trouble bringing it back into the
SGML fold later.

> Almost
> immediately HTML was targeted by the SGML lobby for full compliance.

This is not true. Dan Connolly was the first person to propose an SGML
DTD for HTML. He is hardly in the "SGML Lobby" (talk to him about it
sometime, he has plenty of complaints about SGML) and the SGMLization of
HTML happened long before the SGML lobby really even understood the web.
Tim *hired* Dan to work with W3C and complete the work. In other words,
SGML was always Tim's idea. It goes back at least as far as 1993.

http://www.w3.org/MarkUp/draft-ietf-iiir-html-01.txt

I don't know about you, but I don't recall there being much a web to
"lobby" in 1993. Face it, Tim and Dan thought SGML was neat and they
implemented it. They have had a love/hate relationship ever since (as do
many people) but they have been moving towards SGML at every step (cf.
XML).

> Here's what was added; all of this made my parser much more
> complicated than I think it ought to be (look at how complicated
> sgmllib.py is).  Note that most of what was added doesn't add
> functionality.  In one or two cases it even takes away functionality!

I feel that there is an important point you are missing. SGML offers
lots of extra functionality beyond what HTML takes advantage of. If the
browser vendors (esp. Netscape) had not been explicitly SGML-hostile
(sound familiar?), the web would be much further ahead. But they have
fought tooth and nail to keep the useful features out.

> It just complicates the scanning process in order to be compatible
> with the extremely complicated scanning rules designed for SGML on
> punched cards in the 70s.

I don't know where you get this "punch card" stuff. GML was invented at
about the same time as C and UNIX, and after Simula 67. Goldfarb
invented it to be part of an *interactive document database system*.
Anyhow, this 70s/90s thing is only interesting if we've learned alot
about markup in the intervening 20 years. This doesn't seem to be the
case. TeXInfo, HTML and TIM really didn't introduce anything special
that SGML lacks. It seems the only thing we have learned since the
standardization of SGML is that some of its features are not as
important as we thought they would be. Fair enough -- lets not use them.
 
>     - A second special character '&' for entity references (original HTML
>     used <lt> to escape "<").

Big deal. Different markup for different things. Entity references can
go in attribute values and element content. They are NOT structural
sub-elements and should not be confused with them.

>     - Character references like &#32; or &#SPACE;.

How else are you going to include a Unicode character by number or name?
Are you going to claim that this isn't an "increase in functionality?"
If you need to input a greek character you might disagree.
 
>     - Comments in the form of <!--.....-->, truly the most atrocious
>     comment convention invented (and I believe it's worse -- officially,
>     "--" may not occur inside a comment but "-- --" may, or something like
>     that; but who cares, as almost no handwritten parser seems to get this
>     right).

Comments could be simpler and smaller, but it really doesn't seem like a
big deal to me.

>     - Special stuff to be ignored, starting with <!...>, where it is
>     tricky to determine what the end is (since sometimes "<" or ">" may
>     occur inside.

"<" or ">" can only occur inside *in quotes*. This is like complaining
that the following Python statement is confusing because of the two
colons:

if a=="j:b":

Big deal -- string literal context is different from program context (or
markup context, in SGML).

>     - Special stuff to be ignored, starting with <?...>.

What's so hard or complicated about that?
 
>     - Short tags, <word/.../, which are still mostly outlawed because of
>     compatibility reasons with older HTML processors, but which have to be
>     recognized if you want to clame the elusive "full compliance".

Obviously sgmllib.py will never have full SGML compliance. Presumably
the reason you implemented those short cuts is actually because they are
useful and convenient.

I feel that your negative feelings about a particular process have
spilled over onto SGML. If the browser vendors had done their job
correctly in the first place, these short cuts would be allowed, would
always have been allowed, and would be usable today. You can hardly
blame their SGML-noncompliance on SGML! I might as well blame a
particular Unixes posix incompatibilities on Unix!

>     - It is not possible to turn off processing completely.  There used to
>     be an HTML tag <LISTING> (?) which switched to literal copying of the
>     text until </LISTING> was found.  This is impossible to do in SGML --
>     the best you can do is to switch to literal mode until </ followed by

That is not true. The *DTD* cannot turn off processing completely as
with the LISTING tag. The *author* can turn off processing completely
with a marked section:

<![CDATA[
<<<<>>>>><<<<<>>>>>&&&&&&
]]>

The end of the marked section is indicated by "]]>." But this is going
to be VERY rarely required in Python documentation. The only Python code
that has a </ in it is code talking explicitly about SGML. So once in
every 30 listings, you'll have to use the syntax above. Note that this
syntax is one of the things that the HTML browsers have neglected to
implement, although it is VERY important as you point out. Don't blame
SGML, blame them.

>     a letter is seen, and you can't turn off &ref; processing either.

That isn't true. You can turn that type of processing off using either a
CDATA content element or a CDATA marked section.

>     - Why do I have to put quotes around the URL in <A
>     HREF="http://www.python.org"> ???

Attribute values are string literals, just like in Python. You put them
in quotes to differentiate them from the surrounding whitespace, markup
delimiters, etc.
 
>     - Other restrictions on what you can do with attributes; apparently
>     there's a semantic rule that says that if two unrelated tags have an
>     attribute with the same name, it must have the same "type".

That isn't true.

>     - A content model, which nobody asked for, and which few people check
>     for, but which still allows HTML purists to tell you that your HTML
>     page is "non-conformant" when you place an <H4> heading inside a <LI>
>     list item (okay, so I made that up).

I must admit, I'm shocked to hear you say that. It was exactly *for* the
content model that Tim Berners-Lee and Dan Connolly moved HTML to be an
SGML document type. Please tell me what Grail should do with this
document:

<HTML>
<H1>Here's a rather STRANGE HTMLish DOCUMENT</H1>
<TITLE>This is a title</TITLE>
<TITLE>This is another title</TITLE>
<TITLE>This is a third</TITLE>
<TITLE>Strange to have so many!</TITLE>
<TITLE>But without a content model</TITLE>
<TITLE>This is perfectly legal</TITLE>

<TABLE><LI><TD><TR>Here's a rather odd table</TD></TR></LI>
<P>Curiouser and Curiouser
</TABLE>
</HTML>

Without the concept of a content model, this is a perfectly legal
document, and Grail would have to handle it and do something reasonable
with it (what's the title of this document? what does the table
structure look like?) Without DTDs and content models, you have no basis
for an information system. The fact that HTML authors ignore SGML rules
is a sad commentary on the Web, not on HTML. Those who are building the
web today -- browser vendors and standardizers alike, have asked that
XML be extra strict because they recognize that the current HTML
situation is mess *in spite of* SGML's strictures (and *because of*
widespread SGML ignorance).

If you think it is reasonable to put H4s in LIs, then talk to Dan
Connolly. He can make it possible (in consultation with W3C members). If
you want to make it possible to put ANY element in ANY other element, he
could make that possible too. SGML can allow anything anywhere just like
TIM or LaTeX. But he wouldn't -- he knows that constraints on element
occurences are crucial. Removing them would be akin to asking Python
parsers to handle any random combination of operators and delimiters:

if ( def a(): class b(): pass )

>     - Probably a few other things that nobody asked for, such as the
>     DTD declaration and SGML's approach to character sets (which is
>     probably broken -- I believe there is a way to switch character
>     sets in mid-stream...).

The DTD is an important part of the documentation for HTML and also
important implementation tool for many vendors. I don't know what your
problem is with it.

I don't know that SGML's approach to character sets is broken. Could you
be more specific? And perhaps you could describe how TIM's "approach to
character sets" is superior.
 
> So my claim remains that the requirement of SGML conformance is for
> 99% just a nuisance for parser writers.  Of course I'm biased, since
> I'm a parser writer myself...  So see for yourself what you think of
> this argument.

Of course compliance with any standard is a nuisance. It is always
easier to hack up what you need as you go along. Because of powerful
anti-SGML politics, HTML never took advantage of much of SGML's power.
For instance one of SGML's most basic facilities is the ability to reuse
content in the same document or across documents. But HTML can't do it.
Blame the browser vendors.

Most of the points in your flame seem to me, to be more of an indictment
of anti-SGML bias than of SGML itself. It is as if someone tried out the
famed Posix compatibility mode in NT and then claimed that Unix was
broken based on it. Obviously that environment is not a true reflection
of Unix itself, because its creators were not trying to allow access to
the power of Unix. HTML was supposed to allow access to the power of
SGML, but then Marc took over the web and forward progress ground to a
halt in favour of <BLINK> and <CENTER>.

 Paul Prescod



_______________
DOC-SIG  - SIG for the Python Documentation Project

send messages to: doc-sig@python.org
administrivia to: doc-sig-request@python.org
_______________