[DOC-SIG] What I don't like about SGML
Guido van Rossum
guido@CNRI.Reston.Va.US
Sun, 16 Nov 1997 10:54:00 -0500
Here's the background of my dislike for SGML. To confine this
highly flammable material :-), I'm spawning another thread.
First, while SGML may have been standardized in the swinging '80s, it
definitely has its roots in the '70s -- it takes many years to become
an international standard (look at C++!), and it started its life, as
"GML", long before standardization started. Undoubtedly some of the
worse features in SGML were designed to be backwards compatible
(again, very much like C++...).
I am well aware that HTML is SGML conformant since HTML 2.0, and this
is precisely the reason for my concern.
99.9% of the time, HTML is parsed by relatively simple handwritten
parsers, not by generic SGML scanners. There are lots of programs out
there that have to parse HTML -- preprocessors, web browsers, web
spiders, etc. Why don't these just link to an existing SGML scanner?
Because SGML scanners are *huge*. They need to be big to scan generic
SGML, which is a very complex language. But most of this power isn't
needed to scan HTML, so people roll their own parser.
Before HTML had a version number, I wrote an HTML scanner in Python.
It was very simple. Look for < or </ followed by a letter, then scan
up to a > character, etc. HTML was simple to scan by design: Tim
Berners-Lee wanted HTML and HTTP to be so simple that almost anybody
could write programs that would immediately interoperate with the rest
of the web as it then existed. There is no doubt that this is the
reason that the web took off at all.
But Berners-Lee made one mistake: he made HTML look a bit like SGML
(which he had seen once or twice from a distance :-). Almost
immediately HTML was targeted by the SGML lobby for full compliance.
Here's what was added; all of this made my parser much more
complicated than I think it ought to be (look at how complicated
sgmllib.py is). Note that most of what was added doesn't add
functionality. In one or two cases it even takes away functionality!
It just complicates the scanning process in order to be compatible
with the extremely complicated scanning rules designed for SGML on
punched cards in the 70s.
- A second special character '&' for entity references (original HTML
used <lt> to escape "<").
- Character references like   or &#SPACE;.
- Comments in the form of <!--.....-->, truly the most atrocious
comment convention invented (and I believe it's worse -- officially,
"--" may not occur inside a comment but "-- --" may, or something like
that; but who cares, as almost no handwritten parser seems to get this
right).
- Special stuff to be ignored, starting with <!...>, where it is
tricky to determine what the end is (since sometimes "<" or ">" may
occur inside.
- Special stuff to be ignored, starting with <?...>.
- Short tags, <word/.../, which are still mostly outlawed because of
compatibility reasons with older HTML processors, but which have to be
recognized if you want to clame the elusive "full compliance".
- It is not possible to turn off processing completely. There used to
be an HTML tag <LISTING> (?) which switched to literal copying of the
text until </LISTING> was found. This is impossible to do in SGML --
the best you can do is to switch to literal mode until </ followed by
a letter is seen, and you can't turn off &ref; processing either.
Of course, with a handwritten parser it is no problem to switch to a
mode that scans for </LISTING> exclusively...
- Why do I have to put quotes around the URL in <A
HREF="http://www.python.org"> ???
- Other restrictions on what you can do with attributes; apparently
there's a semantic rule that says that if two unrelated tags have an
attribute with the same name, it must have the same "type".
- A content model, which nobody asked for, and which few people check
for, but which still allows HTML purists to tell you that your HTML
page is "non-conformant" when you place an <H4> heading inside a <LI>
list item (okay, so I made that up).
- Probably a few other things that nobody asked for, such as the
DTD declaration and SGML's approach to character sets (which is
probably broken -- I believe there is a way to switch character
sets in mid-stream...).
Of course, SGML aficionados will claim that all this was necessary so
that HTML could be processed with SGML, the most powerful and flexible
test processing mechanism available. However, 99% of all HTML written
will never be processed by SGML; it is intended for throw-away
content. Serious SGML users have two other recourses available to
them:
(1) Write everything in SGML and generate HTML from that; I believe
Jade can do this.
(2) Write a simple HTML scanner and convert it to SGML, by hook or by
crook. I believe this is being done too.
So my claim remains that the requirement of SGML conformance is for
99% just a nuisance for parser writers. Of course I'm biased, since
I'm a parser writer myself... So see for yourself what you think of
this argument.
--Guido van Rossum (home page: http://www.python.org/~guido/)
_______________
DOC-SIG - SIG for the Python Documentation Project
send messages to: doc-sig@python.org
administrivia to: doc-sig-request@python.org
_______________