[XML-SIG] Re: RSS and stuff

Lars Marius Garshol larsga@ifi.uio.no
02 Jun 1999 10:52:06 +0200


Hi Dan,

* Dan Libby
|
| I'm sure you'll be glad to hear that we are doing our validation
| with python and the excellent XML libraries you all have contributed
| to.

I certainly am glad to hear that! I'm also glad to see actual live
Netscape representatives in a public forum, since I've been wanting to
discuss RSS with you.

And although I say a lot of negative stuff about RSS below I'd like to
congratulate you on the first successful global XML web application.
There are so many RSS documents on the web now, and quite a bit of
software, so I don't think there's any question that this honour
belongs to you. I also don't think there's any question that a major
part of the reason is that RSS is so simple.

I use my RSS client every day now and am very happy with it. I just
wish everyone whose pages I'm interested in would provide RSS feeds,
and I will probably start asking for it pretty soon.
 
| FYI, the current validator is very specific.  It understands the
| "0.9" format intimately at the code level.

This is definitely a good idea. Sadly, though, many of the RSS files
on the net are not even well-formed. The ones for WebMonkey and
python.org spring to mind.

| However, in my spare time I've been working on a generic validator
| that will read in a schema file (of my own devise, not a real XML
| schema) that's written in XML, and then validate a document based on
| that. 

Hmmm. Why not use a real XML schema? It should support everything I
can imagine you would want anyway. Or is it too complex?

| Hopefully I can get it installed soon, and possibly even distribute
| the source, such as it is.  (This is my first Python + first DOM
| coding project).  

It would be great if you did.

| This seems like a pretty obvious thing to me, I'm surprised that XML
| has gotten as far as it has without real support for enforcing data
| types, lengths, ranges, etc.

I can just hear the functional programming freaks (Standard ML,
Haskell and all that) say the same thing about Python. :-)

Seriously, these things aren't as important as many people think. And
it's also worth remembering that XML comes from a document background
where such things are not all that relevant. (Imagine trying to do
this for HTML. Actually enforcing correct use of DFN, H1-H6, ABBR,
ACRONYM, VAR, ADDRESS and all the other elements would require a
serious number of years of AI development in Prolog or Common Lisp.)

| What would you like to see / not see in the format?  It really is
| just supposed to be a summary. 

The first thing I'd like to see is a date element for items. Many RSS
providers currently use something like:

  <item>
    <title>(19990602) New foo!</title>
    <link>...

and it would be useful to formalize that as:

  <item>
    <date>19990602</date>
    <title>...

The second thing is descriptions for items. I'm thinking of providing
an RSS feed for my home page, and when I do I know I will want to be
able to have entries like:

  <item>
    <date>19990602</date>
    <title>RSS feed available!</title>
    <description>I now provide an RSS feed which lists all updates to
    my home page. This will hopefully make it easier for people

A third thing is a place to put the email address of the maintainer so
that I know where to complain when a document isn't well-formed.

There's probably more as well, which I'll think of the moment I send
this. If you want discussion about what RSS should and shouldn't
contain I'd recommend you to try to start it here or over at xml-dev.
(I know Dave Winer has a lot of ideas for it

| Ideally, we would like to support all of Dublin Core eventually, but
| the problem is that the additional data may not actually be used,
| and marketing folks felt it would be simpler to not confuse folks
| too much.
 
I came to pretty much the same conclusion with XSA (see below) and
then discovered that the difficult stuff was needed anyway. But I
still think this is the right way to go:

  - make a simple version and put it out
  - wait for widespread acceptance and lots of implementations
  - then add all the difficult stuff and make it optional (In your
  case: why not make a CGI wizard like I did with XSA, and add a link
  from the RSS guide to the more fancy options?)

In any case, this isn't a new idea, since this is exactly what C, Unix
and C++ have done (to some extent also SAX and XML) and it seems to
work better than the opposite approach, favoured by many little-known
technologies (such as SGML).

* Lars Marius Garshol
|
| (warning: the RSS guide is not very accurate technically)

* Dan Libby
|
| What in particular did you find that was inaccurate?  

Here's a quick list:

  - The guide says: "Name your file using the .rdf suffix, unless you
  are generating your file dynamically using a .cgi or other
  program. Netscape recommends the use of the .rdf filename suffix,
  but does not require it."

  Well, on the web it's the MIME type that counts, so the guide should
  give the correct MIME type and then some hints on how to get it
  right. The suffix is just an ugly trick to get the right MIME type
  on correctly configured servers.

  - "RSS 0.9 supports the full ASCII character set, as well as all
  legal decimal and HTML entities. RSS 0.9 does not support other
  types of character data, such as UTF-8. For a list of legal HTML and
  decimal entities, refer to Special Symbols and Entities on DevEdge,
  Netscape's information resource for developers."

  Well, XML uses Unicode, but I suppose applications can be more
  restrictive. However, you cannot use HTML entities in XML without
  declaring them, and since there is no RSS DTD any RSS file that uses
  an HTML entity is not well-formed.

  - '<?xml version="1.0"?>'

  If you use US-ASCII you might as well declare that you're doing so
  with an encoding declaration. (Parsers may then complain if you
  don't conform to your own declaration.)

  - Also, what's the relationship with RDF? RSS uses the RDF root
  element, but does not conform to the RDF syntax or actually use
  anything meaningful from RDF.

| This brings me to another question.  Do you all believe it is the
| "right thing" to publish a DTD for a format, even if the DTD by
| itself is not sufficient to validate the document?

Yes! A DTD is useful in that it allows you to do at least some
validation, and it's also very useful as a statement of intent (that
is, as documentation). For example, when reading the RSS guide it's
impossible to tell whether one or more textinput elements are allowed
and where they are allowed. The same goes for the image element.

This is the RSS DTD I currently have in my CVS tree. However, I have
no idea whether it's correct or not. For example, I've seen userland.com
use the image element as a special kind of item, so maybe the rdf:RDF
element should have (channel, (image | item)+, textinput?).

<!--

  A tentative DTD for RSS 0.9.

  This DTD has no official standing, it's just reverse-engineered from
  the RSS guide published by Netscape.
    
  Lars Marius Garshol - larsga@ifi.uio.no.
  $Id: rss-0.9.dtd,v 1.1 1999/05/24 20:54:04 larsga Exp $
  
-->

<!ELEMENT rdf:RDF (channel, image?, item+, textinput?)>
<!ATTLIST rdf:RDF
          xmlns:rdf CDATA #FIXED "http://www.w3.org/1999/02/22-rdf-syntax-ns#"
          xmlns     CDATA #FIXED "http://my.netscape.com/rdf/simple/0.9/">

<!ELEMENT channel (title, description, link)>

<!ELEMENT title (#PCDATA)>
<!ELEMENT description (#PCDATA)>
<!ELEMENT link (#PCDATA)>


<!ELEMENT image (title, url, link)>

<!ELEMENT url (#PCDATA)>


<!ELEMENT item (title, link)>


<!ELEMENT textinput (title, description, name, link)>

<!ELEMENT name (#PCDATA)>


| In other words, an XML editor application referencing the DTD would
| allow the user to construct a document that is non-valid with
| regards to our rules.  It seems to me that the DTD then becomes
| something of a distraction, because compliance with it, by itself,
| is not much more useful than well-formedness, from a validation
| point of view.

It's useful in that it provides more information for content providers
and software developers, and in that it's 100% unambiguous. It's also
useful for you when doing validation with custom-written tools, since
you won't have to worry about where elements occur.

I've done exactly the same for XSA and have exactly the same problem
as you. I provided a DTD and have special validating software that
rides on top of a validator (xmlproc). If I were to do it again
there's no question that I would do the same thing. So far there has
been no confusion at all (although I've seen HTML users become
confused by this).

See <URL: http://birk105.studby.uio.no/www_work/xsa/> for more info.

--Lars M.