The Cost of Dynamism
Thomas 'PointedEars' Lahn
PointedEars at web.de
Sun Mar 13 16:05:39 EDT 2016
Chris Angelico wrote:
> On Sun, Mar 13, 2016 at 6:24 AM, Thomas 'PointedEars' Lahn
> <PointedEars at web.de> wrote:
>> Marko Rauhamaa wrote:
>>> […] HTML markup is all ASCII.
>>
>> Wrong. I am creating HTML documents whose source code contains Unicode
>> characters every day.
>>
>> Also, the two of you fail to differentiate between US-ASCII, a 7-bit
>> character encoding, and 8-bit or longer encodings which can *also* encode
>> characters that can be *encoded with* US-ASCII.
>
> Where are the non-ASCII characters in your HTML documents? Are they in
> the *markup* of HTML, or in the *text*? This is the difference.
There is a misconception on your part instead. The text content of an
HTML/Web document (the part between the [HTML] tags) is *part* of the (HTML)
markup as it is (at least) *a part* of the content of (HTML) elements. [1a]
[1b]
Besides, even if one would unwisely adopt your private definition of
“markup”, Unicode characters that cannot be encoded with US-ASCII are of
course allowed verbatim in attribute values, and to a lesser degree (not in
HTML 4.01 and below) in element type names and attribute names, as well –
therefore, according to even your *wrong* private definition of “markup”,
“*in* the markup of HTML”. [2][3]
Bottom line:
If one declares the character encoding that one uses in an SGML-based (HTML
up to including version 4.01, XML and all XML-based document types) or SGML-
related (HTML5) markup document (there are several possibilities for that)¹,
there is no need to use character entity references instead of plain Unicode
characters. And if you avoid spaghetti code, the probability of the need
for numeric character references in HTML is also quite low. (The same
applies to lightweight markup languages like Markdown, but let us not get
there now.)
[In fact, the possibility to use characters verbatim other than those that
can be encoded with US-ASCII applies to all Internet messages, including
e-mail and Usenet postings, and to a lesser degree (because there are fewer
declaration mechanisms available) to all forms of electronically
stored/readable text. As of RFC 5536, standards-compliant Network News
client software is even required to support MIME. [4]]
[This was a professional Web author/developer with more than a decade of
continuing work experience clarifying your misconception. I recommend
to you that you subscribe to the newsgroups in the
comp.infosystems.www.authoring.* hierarchy, where this discussion would
have been on-topic, and to <news:comp.lang.javascript>, to clarify some
of the other misconceptions that you may have acquired about
Web(-related) authoring/development.]
________
¹ This is only to be reasonably safe from surprises; several of those
markup languages require the assumption of a default character encoding
and/or the implementation of character encoding detection for their
parsers, but not all parsers are conforming, and it stands to reason
that parser efficiency can be increased if the encoding does not have
to be detected/inferred at first.
[1a] <https://en.wikipedia.org/wiki/Markup_language#Etymology_and_origin>
[1b] <https://www.w3.org/TR/1999/REC-html401-19991224
/intro/sgmltut.html#h-3.2.1>
<http://www.w3.org/TR/2014/REC-html5-20141028/dom.html#elements>
[2] <http://www.w3.org/TR/2014/REC-html5-20141028
/infrastructure.html#encoding-terminology>
[3] <https://www.w3.org/TR/1999/REC-html401-19991224
/charset.html#doc-char-set>
<http://www.w3.org/TR/2014/REC-html5-20141028/syntax.html#parsing>
[4] <http://tools.ietf.org/html/rfc5536#section-2.3>
> And I'm not conflating those two. When I say ASCII, I am referring to
> the 128 characters that have Unicode codepoints U+0000 through U+007F.
That is only your private definition of ASCII. The commonly accepted
definition is along those lines instead:
<https://en.wikipedia.org/wiki/ASCII> pp.
(See also the Specification references above.)
HTH
--
PointedEars
Twitter: @PointedEars2
Please do not cc me. / Bitte keine Kopien per E-Mail.
More information about the Python-list
mailing list