HTML != XML (was Re: [ANN] Kludgey workarounds for xt)

Tyler Baker tyler at
Thu Sep 10 00:18:48 BST 1998

Eddie Sheffield wrote:

> But it seems that the problem isn't the HTML, but rather with SCRIPTS that might
> be included in the HTML. I believe that HTML defines the <SCRIPT
> LANGUAGE="whatever">...</SCRIPT> tags, but NOT the actual script that lies within
> the tags. This is where the problem is. That script might be one of many
> languages (javascript, jscript, vbscript, ecmascript, etc.) and knowing exactly
> how to properly post-process the fine would be VERY non-trivial, especially if
> the script itself has to generate HTML on the fly. For example:
> What I want:
> document.write("She said &quot;Run away!&quot;");
> but the generated code is:
> document.write(&quot;She said &quot;Run away!&quot;&quot;);
> Obviously a post-processor can't simply replace EVERY &quot; in the line, or the
> script becomes invalid. But how do you know which to replace and which not? I
> suppose you could parse the script and try replacing the ones that are necessary
> for the script to be valid, but then you would need separate processors/parsers
> for each type of script language that might be in the script.
> As much as possible, a workaround would be to use external scripts that are never
> processed at all, but are pointed to with the optional SRC attribute on the
> SCRIPT tag. This only works for scripts that don't have to be dynamically
> generated, though.
> It does seem odd that with the advent of the DOM which really eases scripting and
> makes it much more powerful that almost simultaneously problems occur that make
> generating those scripts more difficult.
> Eddie

The approach I use for the XML Formatter I have is to have a boolean setting that can
be optionally set which will either auto-replace occurrences of entity values in
character data and attribute values with entity names (this includes character
entities) or else do none of this.  Another alternative is to wrap any character data
that includes processed text that is read for output which includes entity references
in some special object that is essentially a flag saying do not process this stuff or
even normalize it.  This is what I do now for CDATA Sections and this same technique
is pretty much what is used for the DOM so you can distinguish between text that can
be normalized and text that should not be normalized.

Maybe XT should have something like:


which does not auto-replace instances of <, >, &, ", '.


xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at
Archived as:
To (un)subscribe, mailto:majordomo at the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at

More information about the Xml-dev mailing list