2 questions about XML validating parsers.

Paul Tchistopolskii paul at qub.com
Sun Jun 6 06:15:31 BST 1999


1. entities preprocessor.

I found that for some XML parsers ( like Expat, for example, but not only
Expat ;-)
processing entities ( especialy PERefs ) is not a kind of basic
>From my point of view, if parsing/validating XML is like compiling C,
entities and resolving external  references is kind of thing that C
does. Given the shape of existing validating parsers, for me it makes sence
write an XML preprocessor, that will resolve all the PERefs, 'include' all
DTD's and will produce an 'entities free' XML stream  ( like C preprocessor
Even this task looks trivial, there are some interesting twists there, like
efficiency, whitespace / formatting control  e t.c., so the question is:

Is there any tool that looks like 'XML preprocessor' ? In other words - is
there some
tool  I can use to :
    -    take a complex XML documents contaning complicated entities e t.c.
    -    produce an XML stream that whould be acceptable by Expat ;-)

2.  The problem of 'sharing-very-close-but-not-exactly-the-same' XML

If I have


I still need to decalre <!ELEMENT B somewhere, to make



Is there any  *practical* reason for such a restriction? I mean that
when somebody creates some DTD with element A of type ANY it
means that the structure of the body of element A is absolutely
unpredictable ( otherwise, why declaring it ANY ? ). Or there is
some other practical reason to declare some element to be ANY ?

So on one hand we have 'the body of this element is unpredictable'
on another hand we are placing  some extra restriction that is stronger
than 'just be well-formed'.

Unfortunately, I don't understand what is good with that extra restriction
in practice, not in theory. Theory seems to be : "there should
be no undeclared elements in any part of the valid XML document".

( BTW why not? What practical problem it may cause to have undeclared
elements in that ANY part  ( and only there...) ? )

What I do understand is the problem that such restriction introduces
on practice.

A couple of weeks ago I asked how can I solve some
practical problem that I have - when 2 different companies
are exchanging the 'very-close-but-not-the-same' documents.


>  The only difference between A-documents and
>  B-documents is that A-documents have
>  <A1> element that is specific to company A
>  and B-documents have <B1> element specific
>  to company B - the rest of elements and
>  attributes are the same for both companies.
>  Now our companies decided to exchange
>  their documents. As a solution they may write
>  XSL stylesheets, or ( maybe) use entities
>  to remove A1 elemets and B1 elements.
>  I see no other ways to workaround this
>  situation and both are a bit complicated.
>  The problem here is to hide some part of
>  XML document from the validation process.

You could do write a single DTD with the following:


and then have each company define their company-specific elements in
company-specific DTDs, which would be included through the internal subset.
This is open to abuse, though, as a valid document could contain any
elements, not just the intended, company-specific elements, in the
COMPANY_SPECIFIC element. However, if the documents are machine generated,
this is not a problem.  (By the way, EMPTY and ANY(THING) are valid only in
content model definitions, not attributes.)


So when one company introduces some new tag A1 the appropriate
<!ELEMENT description ( at least <!ELEMENT A1 ANY> ;-) should
be populated to the DTD's that are used by another company.  Now
what if we have 3 companies? 10?   More? ( And we *will*  have such
problems with XSA and XSA-alike networks ).

All those problems would simply disappear if just allowing ANY to be *realy*
ANY ( it means that Validating parsers should allow undeclared
elements to appear in the body of element of type ANY ).

So the question is: what is a practical reason for  not making ANY to be
realy ANY ?

( Yes, we can always write yet another XSL stylesheet for rendering
those 'very-close-documents' to each other, and because documents
are closea  - those stylesheets would be easy to write. What I don't
understand is why should we write some extra code if just allowing
ANY to be ANY would allow *not* to write extra code at all... )

Maybe I'm missing something again - I'l appreciate any feedback.


paul at pault.com

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)

More information about the Xml-dev mailing list