[David Priest]
I'm feeling that I've perhaps pissed in someone's cornflakes, but I'm going to respond anyways. If I've offended, I apologize profusely: no offense was intended.
No offense perceived or taken. We're just debating the technical issues; nothing personal. Sorry if it seemed too blunt, but email tends to look that way. I don't have the time to be super-friendly or the inclination to sprinkle smilies throughout. When discussing issues via email, one needs a thick skin. Assume that the writer is smiling continuously, trying to help, which I was and am. [David Goodger]
XML character entities (&xNNNN;) are unknown to reStructuredText and are the wrong way to do it. Docutils is correct in substituting "&" in HTML output for every "&" in the input file (how could it tell the difference between *using* an XML character entity and just *talking* about one?).
The charent pattern can be detected easily enough, and the "&" encoding skipped for those entities. If you want to talk about an entity directly, literalizing it would do the trick. In all other cases the ampersand can be safely encoded.
By literalizing do you mean ``inline literals`` or literal blocks? That's not always acceptable. I might want to say The '&' entity is used by HTML and XML to represent the '&' character. I shouldn't have to use inline literals here. Docutils uses Unicode internally, and I don't see a need for it to grow a character entity subsystem. So far, you're the only one who has asked for one, and that's not convincing enough. I suspect that you may be asking Docutils to cover a deficiency in your toolset, or there's a misunderstanding. Please answer these questions from my last message to help clear this up: Why are you unable to insert the actual, encoded characters into the text? What *are* you able to insert? What encoding are your files using? What platform (OS, editor, etc.)? It could be that you *can* insert real characters but don't know it.
But the substitution table using "proper" characters is good enough, although I'm not entirely sure that all backends output will be able to deal with two-byte Unicode.
HTML can handle UTF-8. XML uses Unicode internally and assumes UTF-8 or UTF-16 unless told otherwise. As for back-ends, that's a Writer issue. If output format X can't handle Unicode, then the X format's Writer needs to encode those characters or signal an error. TeX can't handle the NNNN; form. Here's an alternative for you. If you want to use &whatever; XML character entities in your source, just put a simple filter into your tool chain that converts those entities into UTF-8. Something like:: charents2utf8 input.txt | docutils/tools/html.py > output.html There must be filters like that "out there". If not, it wouldn't be hard to write one, I think. A codec would do just as well, but Python doesn't come with such a codec (more's the pity). I just realized the front-ends don't have support for explicit stdin/stdout with "-" arguments. I'll add that soon. [re: interpreted text roles]
There's absolutely no need for any regex substitution here. That's exactly what the Writers are for. The ":gui:`File`" input may become "<gui>File</gui>" in the internal document tree. The HTML Writer would write it with bold (or, better yet, with <span class="gui">, made bold by the stylesheet). The DocBook Writer would write it with <guilabel>.
The problem with implementing it as a Writer is that Writers don't travel with the source text files. If I send you a ReST file with :gui: roles in it, what's your DocUtils installation going to do with it?
The set of roles built into Docutils itself will grow. If the growth proves to be unlimited or unmanageable, there will have to be an alternative. If those roles are not handled by the default Docutils set, then they can be local to your installation. They wouldn't be portable, true; there's only so far a "standard" can go and we can't please everybody 100%. There's also this alternative:
There has been some discussion about parameterizing the interpreted text system somehow, to avoid proliferation of element types (gui, keypress, etc.). No decision or action yet.
See the Doc-SIG thread, "master plan for interpreted text?" from last month.
If "parameterizing the interpreted text system" means that simple role substitutions -- the kind that can be handled by regex -- can be placed within the source text files, great! It makes the source text more portable.
But I doubt it will take the form of "regex substitutions". That's just too low-level, IMHO.
If the role can't be handled by a regex, then of course it's going to require a Writer. (Although... if one could embed a Python script... but, no. That's verging on silly.)
Have you read PEP 256 & 258 yet? Please do. They explain the Docutils architecture and the purpose of the components (Writer, Reader, Parser, etc.).
And if this email has been scrunched into a blob, I apologize.
Try leaving a space on each blank line. I.e., [return] [space] [return]. I've done that just above. -- David Goodger http://starship.python.net/~goodger Programmer/sysadmin for hire: http://starship.python.net/~goodger/cv