Simple approaches to XML implementation

F. Chahuneau - General Manager fcha at Berger-Levrault.fr
Sat Mar 1 18:20:36 GMT 1997


[from PMR]

> 
> ESIS doesn't retain everything from the original document(s) and I've
> been asking the experts what gets lost.  

In case someone wants to get even more precise information, ESIS (Element=
 
Structure Information Set) is fully defined in annex G of document 
ISO/IEC/JCT1/SC18/WG8/N1035: Recommendations for a Possible Revision of I=
SO 
8879 (SGML). You can find an exact replication of this passage in Charles=
 
Goldfarb's "SGML Handbook" (Clarendon Press, 1990), pp 588 to 591.

> My rough summary is that > XML->ESIS loses:
> - comments (this matters if you want to edit the document or have
>       it read by humans.  However comments should not be used
>       by machines - simply passed through)

True

> - entities.  If your document includes entities such as &chapter1;
>       these may be expanded and replaced by their contents.  In
>       this way some of the structure may be less clear

It's actually more complex than that. 

SGML *text* entity references, whether entities are "internal" or 
"external", are indeed fully expanded and you are not even notified this =
in 
the ESIS event stream. Therefore, ESIS does not convey the "entity 
structure" of an SGML document. This is, by the way, irrelevant to most 
applications ... except for those, such as some SGML editors, whose purpo=
se 
is seen as being able to manipulate SGML documents without arbitrarily 
altering their entity structure (in addition to their element structure).=


External data entity references, internal SDATA and PI entity references =

are signaled in the ESIS, while CDATA internal entity references are 
expanded without being reported. This may appear as as bizarre design 
choice, but there is something even more disturbing: in the case of 
internal SDATA entity references, only the entity "replacement value" is =

passed, not the entity "name". This of one of the reasons why ESIS 
information, alone, does not allow to implement an "identity 
transformation" for SGML documents, even when you don't care about the 
physical decomposition of the document into several files (SGML entities)=
. 
Note that SDATA entity disappear in XML, so that THIS PROBLEM DISAPPEARS =
AS 
WELL!

> - conditional markup.  If you use INCLUDE and/or IGNORE then the
>       IGNORE'd sections won't come through and the INCLUDE'd 
>       ones won't be marked as such

True

>  [I think that processing instructions come through OK?  

True

>  And that you can determine whether an attribute value was defaulted
>  or not?]

Unfortunately not. This information is unavailable in ESIS, and you would=
 
need to access some "DTD information set" to be able to recover it. Besid=
es 
attribute names and de facto values, the only side information you have i=
n 
ESIS is when the value for an #IMPLIED attribute has not been specified.

There is one more piece of information missing in ESIS, and which causes =
a 
problem to implement an "identity transformation" for plain SGML document=
s: 
you don't know WHICH ELEMENTS HAVE BEEN DECLARED #EMPTY in the DTD. You 
may know when an element has null content, but you don't know whether thi=
s 
is because it happens to be so (optional content) or because it can't hav=
e 
any (declared #EMPTY). Therefore, you do not know whether you should outp=
ut 
an end tag for it or not. Again, you would need some "DTD information" to=
 
disambiguate. Maybe not everyone realized it yet, but this *is* the one a=
nd 
only reason why XML introduces this explicit </EMPTY/> syntax for empty 
elements. This, again, makes this problem disappear with XML.

All in all, you can see that some design decisions in XML were precisely =

motivated by the desire to make an ESIS event stream sufficient to 
implement an identity transformation, even with no access to DTD 
information. This is, of course, totally consistent with the idea that DT=
Ds 
should not be systematically needed for processing XML fragments.

Whether you work with an event stream or an abstract tree(*) is orthogona=
l 
to this discussion: we are discussing about the *available* information, =

not about the way it is represented. This does not mean that I see abstra=
ct 
trees as useless, all the contrary (see my previous mail).

I hope I helped clarify what ESIS was.

(*): I use the term "asbtract tree" instead of "parse tree" to designate =

the "tree of typed nodes with attributes" (you could also say "SGML objec=
t 
tree", but this term to be somewhat overloaded these days...). From an SG=
ML 
parser's point of view, an SGML "parse tree" would have distinct nodes fo=
r 
start tags and end tags, which are not what you are looking for when you =

want a useful representation allowing to cut-and-paste SGML elements (see=
n 
as atomic, typed text objects with attached properties).


 
        
_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/
       _/ Fran=E7ois CHAHUNEAU                 phone: [+33] 1 40 64 43 00=
  _/
      _/ Directeur G=E9n=E9ral/General Manager                           =
   _/
     _/ AIS S.A.                             FAX: [+33] 1 40 64 43 10  _/=

    _/ 15-17 rue R=E9my Dumoncel    email: fcha at ais.berger-levrault.fr  _=
/
   _/ 75014, Paris, FRANCE        WWW: http://www.berger-levrault.fr _/
  _/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/

xml-dev: A list for W3C XML Developers
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To unsubscribe, send to majordomo at ic.ac.uk the following message;
unsubscribe xml-dev
List coordinator, Henry Rzepa (rzepa at ic.ac.uk)




More information about the Xml-dev mailing list