What Clean Specs Achieve

Wed Feb 10 08:15:28 GMT 1999

roddey at us.ibm.com wrote:

> >>>
> >>>But is anyone here trying to _implement_ Java?  Lots of folks here are
> >>>indeed trying to _implement_ XML 1.0 (parsers and SAX), XLink and
> XPointer,
> >>>Namespaces, XSL, etc.  It's not like we're only trying to _use_ them, as
> is
> >>>the case with Java (or SQL, another example that's been bounced around.)
> >>
> >>Most of them seem to be succeeding.  What should we conclude? -Tim
> >
> >Most people who don't succeed, don't announce.  We can't conclude
> anything.
> >
> >Judging from the volume of questions (and controversy) on this and its
> >sibling lists (XSL-list, xlxp-dev), there's a lot of improvement that
> could
> >be made.
>
> As an above average developer, who just implemented the bulk of his first
> XML parser (C++) in a binge over the last month, I have to question whether
> any 'average' developer will ever implement a full featured parser. I found
> it very non-trivial to write an XML parser that was well decomposed and
> layered and pluggable, while retaining competitive performance. I found
> that XML itself was not very conducive to fast processing and reasonably
> simple architecture.

Very true.  Just to get something working that handled the data I was using took about two
weeks of my time.  One week reading the spec and asking questions and the other week writing
the code.  This was back in January of 1997 when there were no XML tutorials around (the spec
was not even a recommendation then).

> As to the spec... I don't mean to hurt anyone's feelings, but I found the
> spec during that effort to be as confusing as enlightening. It describes
> the logical (sometimes illogical :-) design of XML. But it doesn't help so
> much when it comes to trying to apply that to some physical design. Of
> course that's not their job, but obviously there have been a good number of
> parsers written and some obvious issues in implementation could be
> discussed, to save implementers from doing the same things over and over
> again and then having to fix them. Of course now its all obvious :-) But I
> had to really struggle through it the first time. A 4 or 5 page prose
> document describing the most obviously implementation pitfalls (and
> possibly some obvious implementation strategies) could have saved me a week
> probably. Yes the spec is supposed to describe XML, but is its overall goal
> not to facilite the development of software that implements it?

I doubt that is the goal, but many people are hesitant to disclose their parsing "secrets"
(-:.  I think of XML parsers as pure commodities that you cannot make a penny off of unless
you have some higher level tools built on top of a good parser framework.  I have found that
at least in Java, a lot of the things I learned while tuning performance were things which
helped me out in a lot of areas of programming that have nothing to do with XML.  I think Mr.
Clark likes to refer to his generous works as reference implementations, however, XP is not
something to easily learn from as it is very low-level and not very straightforward in terms
of interfaces (not trying to disrespect Mr. Clark here as the XML parser I wrote may be fast
but the code is practically unmaintainable as my extreme efforts at quality performance
severely compromised good software engineering principles that I usually try and follow in my
work).  I think this can be said of just about all of the XML parser out there, they are all
spaghetti except for perhaps Aelfred.

> And I suspect that perhaps there are probably parsers out there, where the
> developers really cannot intellectually prove that they do the right thing.
> I would be willing to bet that some of them just fix problems until it runs
> the James Clark tests and digest the Bosak files? When a customer reports a

That is what I did for a long time.  Debugging through the entire Clark test suite took a week
or more and I still don't pass much more than 90% of the ones that test for not-well formed
documents, but I suspect Mr. Clark spent a lot longer than a week doing the test suite (-:

> problem, and sends in a sample file, then they look at the spec and try to
> see if that file seems to correspend to the spec and fix their code to
> handle if so. That is far easier than trying to prove that every method in
> your code meets the spec (though its obviously not the optimum thing to
> do.)

Yah, generally if you control how your data is created, you can whip up a decent parser to
meet your needs.  Also, if you don't check for a lot of the obscure errors that may pop up you
can save yourself a ton of time in processing overhead.  Unfortunately, in my case the XML
parser will be used in an end-user product where users may edit files manually (and screw
things up in the process).  But if you just want to have some basic XML capabilities for your
organization and don't want to deal with using other people's codebases, XML is not too much
of a beast (understanding the spec takes longer than writing the code at first).

> Am I being too cynical here? Maybe so. But, I just don't think that an
> 'average' developer could write an XML processor that is complete,
> expandable, maintainable, and speedy, if all he/she had to work with was
> the raw XML spec (at least not in a time that would be acceptable in a
> commercial setting, which is what mostly counts I guess?) I think that it
> would more likely just be 'proven' to be correct through empirical testing,
> not through an ability to completely understand all the interactions
> expressed in the XML spec and implement them cleanly.

Very true.  I fell into this trap when people on this list were talking about how an average
university CS student could whip one up in a week.  At first I said "geese this is easy" but
when I started caring about performance and being able to detect some of the very obscure
errors to be 100% compliant with the draft, I found myself going insane on doing a lot more
work with XML than I originally intended.  Then this XML stuff balloned into a bunch of XML
related work for several clients with these tools and now I am here discussing XML with
everyone else when all I intended at first was to just have basic XML support in the core
application I was working on.

> Also, the interactions that just exist in XML (regardless of how well or
> badly they are expressed in the spec) means that the skill level required
> to do something that is *maintainable and expandable* (i.e. well decomposed
> despite all the interactions) is that higher still. Arguing whether or not
> someone could manage to read the spec and squeeze something out that (in
> whatever shape) was a fully compliant parser, isn't very meaningful to me.

I could not agree more.

> Oh well, that's my po' two cents worth. I think that yes you need a dry
> laying out of the facts *and* some guidance at a higher level, related as
> much to possible implementation issues as interpretation issues. I think
> that the current spec perhaps is somewhere in between the two and thus
> somewhat fails to fully please either master?

You can thank the many people here who have provided open-source parsers to work from (I was
never able to actually get mine out in open-source form as I originally intended for various
business reasons), though I myself decided to waste a lot of time coming up with an XML
architecture that works very differently from the event-based or tree-based parsers out there
as it is more of a data-driven model than anything else (oh I forgot to mention Lark from Tim
Bray which uses a DFA model that is unique to the current crop of XML parsers).  I would say
Aelfred is the best "reference" implementation out there if you could call it that and anyone
who just wants to whip up a decent event-based XML parser should take a look at his source as
it is pretty clean and straightofrward.

Tyler

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)