Should HTML entity translation accept "&"?
steven at REMOVE.THIS.cybersource.com.au
Mon Jan 7 04:55:44 CET 2008
On Mon, 07 Jan 2008 12:25:07 +1100, Ben Finney wrote:
> John Nagle <nagle at animats.com> writes:
>> For our own purposes, I rewrote "htmldecode" to require a sequence
>> ending in ";", which means some bogus HTML escapes won't be recognized,
>> but correct HTML will be processed correctly. What's general opinion of
>> this behavior? Too strict, or OK?
> I think it's fine. In the face of ambiguity (and deviation from the
> published standards), refuse the temptation to guess.
That's good advice for a library function. But...
> More specifically, I don't see any reason to contort your code to
> understand some non-entity sequence that would be flagged as invalid by
> HTML validator tools.
... it is questionable advice for a program which is designed to make
sense of invalid HTML.
Like it or not, real-world applications sometimes have to work with bad
data. I think we can all agree that the world would have been better off
if the major browsers had followed your advice, but given that they do
not, and thus leave open the opportunity for websites to exist with
invalid HTML, John is left in the painful position of having to write
code that has to make sense of invalid HTML.
I think only John can really answer his own question. What are the
consequences of false positives versus false negatives? If it raises an
exception, can he shunt the code to another function and use some
heuristics to make sense of it, or is it "game over, another site can't
More information about the Python-list