having to deal with mal-formed XML

Jonathan Eisenzopf eisen at pobox.com
Wed Feb 24 20:36:03 GMT 1999

Chris Weikart wrote:

> Well of course you're both right. And being right is important for the
> evolution of XML. But it's useless for me and what I need to do, here and
> now, in the short-term, in which I must operate.
> My time frame completely precludes berating the publishers of bad XML. So I
> am pursuing an imperfect, engineering solution to the problem. The current
> bag of fixes has a negligible impact on my performance, since I'm
> bottlenecked on other, far more expensive processes. And it covers 100% of
> the errors I've found so far, in 3K URLs, so I'd guess it'll cover over 90%
> of what I eventually find - in the ASX subset of XML. Therefore, as an
> engineering solution, it's a very good one.
> I should mention, btw, that I could find no Microsoft advertising to the
> effect that ASX V3 is XML. I looked at it and decided that they based it on
> XML. Most of it parses as XML. Ultimately my use of fixups and XML::Parser
> is far better than writing a specialised ASX parser because (a) it produces
> quite acceptable results for (b) far less effort.
> I could rant and rave about Microsoft (believe me, I do ;-), and tell you
> more about the engineering tradeoffs I'm balancing. But ultimately, I seem
> to have posted to the wrong list. Thanks for your responses, and sorry to
> have wasted your time!

This is an interesting and important thread. Most of us on this list know the
basic XML rules, that they are strict, and must be enforced. On the other
hand, the Desparate Perl (or XML) Hacker will have to deal with mal-formed or
XML-like formats that they have no control over.

So what to do?:
1. write a script to make it well-formed XML
2. write a non-XML parser
3. send the content author a nasty-gram telling him where he can stick his
cruddy-malformed-nonXML-to-impress-the-boss-format-shoulda-RTFM(past tense)

While most of us prefer option #3, doing so would probably not win us a
customer relations prize. My recommendation would be #1 if it's possible. #2
is ok if the format is simple and it would take less work than #1.
In summary, Chrisyoudidtherightthingdon'thateusbecausewe'reanalretentive

BTW, I recently ran into this problem and took option #2. It's not optimal,
but I had to get the job done, like Chris. I put it into an article
at: http://www.webreference.com/perl.

Looking back, I probably would have done it differently, but hey, it's
working. Fortunately, since I wrote the app that generates the crappy XML last
Spring, and the client and article that parses the crappy XML last summer,
I will soon have the opportunity to fix the crappy XML and realign the rift in
the space-time continuum.


xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)

More information about the Xml-dev mailing list