Do I need to use a validating parser?

Marcelo Cantos marcelo at
Wed May 5 05:28:41 BST 1999

On Tue, May 04, 1999 at 12:54:02PM -0400, Joshua E. Smith wrote:
> XML-Conformant.  Cool, I'll add that to the FAQ you wanted me to write. ;)
> > > If you were using a programming language which is XML-ish, what XML
> > > features would you be annoyed to see left out (substitution of entities is
> > > an obvious one, which I've seen 3DML slammed for)?  
> >
> >I find it very hard to imagine coding in a Turing-complete
> >programming language that is XML-ish -- markup languages are usually
> >quite clumsy for representing programming languages.
> >
> >What exactly do you mean, here?
> I really meant what I wrote.  I'm assuming that most programmers will not
> actually write in the markup language, but rather will use editors which
> produce markup as their output.  If you think about it, that's what's
> already happening with tools like Access or Delphi (users work in an
> editor, and for the most part, don't touch the code), and of course that's
> almost the only way anyone can deal with HTML anymore.

This is not true.  It is still extremely common for HTML authors to
code raw HTML in a text editor.  In fact, if any degree of
sophistication is required of your web site, a high level HTML editor
will, more than likely, just get in the way.

We gave up looking for a good HTML authoring environment because we
almost always found that it was easier just to knock it up in a text
editor.  Moreover, any site that contains a large number of dynamic
pages will not benefit at all from an fancy editor.

If people do find themselves depending more and more on smart editors,
this can hardly be used as an argument in _favour_ of HTML.  It would
be nonsensical to say we should use HTML _because_ there are ways to
cope with its arcana!

> So from that perspective, even if it was clumsy, it wouldn't really
> matter.
> The right question is: Can a program be represented as a tree?  And
> the answer is always yes.  For example, think about LISP.  What is
> that if not a tree structured language?  And XML is *GREAT* for
> representing trees.

What about C/C++ with their goto and switch constructs?  How would you
represent this as a tree?

    void f(int i)
	case 1:
	    if (foo())
	case 2:
	case 3:
		if (!foo())
		    goto aa;

Sure you can squeeze this into a tree, but it would be a mess.  For
instance, how would you introduce case 2 within the scope of the
switch statement (as opposed to the scope of while(baz()) or global
scope)?  Likewise, how would you introduce label aa at function scope?
XML simply cannot do this without implicit help from something like

> Now consider what happens to your favorite ALGOL-derived language (say,
> Java) when you compile it.  It gets formatted by YACC into a parse tree.
> So represent the parse tree in XML to begin with, and get rid of the
> compiler front end.

No it doesn't.  I believe yacc is a LALR(1) parser, which means it
never builds a parse tree.  Rather, it invokes user-defined code
snippets when productions are matched.  (This is, of course, not to
say you can't build a parse tree within your own code.)

Furthermore, you are not removing the front end, but simply replacing
it.  And I am not at all convinced that this is a good thing.

> My language isn't anything like LISP or ALGOL, but I think this gets the
> point across.  It's pretty easy to write programming languages which are
> XML-Conformant.

I think a more interesting project would be a meta-language, like
Knuth's WEB.  Under this scheme, a program would be represented as an
XML document, and this would be used to generate the source code for a
compiler.  Hence the XML is not considered the final source, but is
instead translated through a 'formatter' into the language of choice.
With such an approach, the XML document would not represent the
complete parse tree, but would instead store snippets of target
language code to be emitted at appropriate points in the generation

One thing to note is that the meta-language may actually constrain the
compiler language constructs that can be output.  For instance, an
SGML representation of a switch statement might look like this:

    <switch expr="i">
	<case><int value="1"/>
	    <if><cond><call f="baz"/></cond>
		    <call f="bar"><var name="i"></call>

This scheme does not have the capacity to produce my original C
example with its cases and label/goto.  Whether this is a good or a
bad thing is, I'm sure, a topic for lively debate.

> My language doesn't use constructs like "if-then" or "for-next" so the user
> wouldn't be exposed to any nasty parse trees anyway; but even if it did, I
> don't know that
> <for var="i" from="0" to="100">
>   stuff
> </for>
> is all that clumsier than the non-tagged equivalent.

It also wouldn't work:

  for i = 100/f(j) to int(sqrt(k) do

could not become:

  <for var="i" from="100/f(j)" to="int(sqrt(k))">stuff</for>

Instead, it would have to be:

    <for var="i">
		<int value="100"/>
		<call f="f"><var n="j"/></call>
	    <call f="int">
	    	<call f="sqrt"><var n="k"/></call>
	    <!-- stuff -->

Hence, even your simple example would be more like this:

    <for var="i"><from><int v="0"/></from><to><int v="10"></to>

Now, you could always reduce the size of these constructs a little by
removing some of the redundant elements, such as <from>, <to> and
<do>, but that would actually be a net loss in clarity, IMHO.

My primary objection to all of this, however, has nothing to do with
complexity or feasibility (it most certainly could be done).  The
right question is not _can_ a program be represented as a tree, but
_should_ it.  C++ compilers, for instance, all have numerous bugs and
many of them have quite serious bugs when it comes to combining
templates, destructors and exceptions.  But the problems they have
have nothing to do with the difficulty of parsing and everything to do
with the complexity of the C++ object model.  This is not an attempt
to defend C++ syntax, which is indeed something of a bug-bear to
parse.  But the point to note is that this is _not_ the area where
compilers come to grief, and therefore it is not where efforts should
be focused.

So my question is, what do you gain?  How will my life be improved if
I have an XML conformant programming language?  All I seem to have
gained is an extra layer of complexity (the custom editor) that I am
forced to deal with because trying to work directly with the
underlying code is the stuff of my worst nightmares.

I am not into climbing mountains just because they're there.



xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at
Archived as: and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo at the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at

More information about the Xml-dev mailing list