XSL and the semantic web

Tue Jun 22 03:48:58 BST 1999

On Mon, Jun 21, 1999 at 12:19:36AM -0400, Paul Prescod wrote:
> Marcelo Cantos wrote:
> > 
> > Then the following two transformations:
> > 
> >   <employee status="active">
> >     <name>Joe</name>
> >     <phone>555-12345</phone>
> >   </employee>
> > 
> >   <H3>Joe</H3>
> >   <P>Phone: 555-12345</P>
> > 
> > Are of a fundamentally different character.  It is not a simply case
> > of having more or less information.  In the second example, even the
> > structure of the information you are entitled to (and this from the
> > owner's viewpoint) has been lost, and gratuitously so.
> 
> Gratuitiously in what sense? You need to format the thing, right?
> Therefore you need to map to formatting constructs.

Yes, gratuitously.  In the context of XSL, formatting can be done by a
style sheet at the client end, hence there is no need to provide the
final FO's.

> > FO's leave you with what might as well be a GIF rendition of the
> > information you are after 
> 
> That's a serious exaggeration. Can you text-index a GIF? Can you do a
> "find word" in a GIF? Can you convert a GIF to RTF, load it into word for
> Windows and start typing?

None of the issues raised here applies to the discussion at hand,
which is automated processing of content.  This is what the "semantic
web" is supposed to be about isn't it?  Sure you can word index the
HTML, but that won't tell you which word is the person's name.

So let me be more explicit: FO's are, for the purposes of automated
extraction of semantic content, little or no better than GIF's.

> David is completely right that these things live on a spectrum. GIFs
> are far, far down the end of the spectrum beyond FOs.
> 
> > A qualified FO:
> > 
> >   <DIV CLASS="employee" status="active"> <H3 CLASS="name">Joe</H3>
> >   <P>Phone: <SPAN CLASS="phone">555-12345</SPAN></P> </DIV>
> > 
> > would certainly go some way towards easing the strain though I
> > don't know if typical FO models (XSL in particular) allow this
> > much flexibility.
> 
> I think that there is an important but subtle point that keeps
> getting lost. The term "employee" is absolutely useless unless I
> know know about it *in advance*. Unless I am expecting to get
> thousands of documents about "employees" I can't set up the
> stylesheets, queries, etc. to make this information useful.

You shouldn't have to.  Content providers should be quite capable of
providing both the semantic content and machine-readable instructions
for presenting it in a browser.  This satisfies both the
casual peruser with a generic client and the information seeker with
custom-built tools.

I would be surprised if the content provider went to all the effort of
exposing semantic markup and then didn't bother to tell anyone what it
meant.

> An "H3" is more useful to a browser than an "EMPLOYEE" because the
> former is *known in advance*. In all of this hand waving about the
> semantic web, people seem to think that once you put the semantics
> out everything just falls into place. Getting the semantics out is
> the EASY PART.  Rationalizing them is the hard part.

Of course it's hard, and maybe people do expect too much.  But
neither of these points constitutes an argument against doing it.
They are nothing more than a warning to be realistic in our
expectations.

In any case, none of this is going to prevent plain old generic access
through a browser-with-stylesheet.

> If Lexis-Nexis publishes its terabytes of data in a proprietary
> document type, it might as well be Greek. HTML is more useful
> because I can at least display it.

This is patently false.  All you need is a stylesheet, which it would
be Lexis-Nexis' responsibility to provide you with if they wanted to
let you display it (if you are arguing for HTML then obviously they
want you to be able to display it).

> Guessing at the structure of a document type from element type names
> is as dangerous as guessing based on text content like colons and
> font sizes. If you want the semantic web to be robust, you need
> people to WANT to publish semantic data in *standardized document
> types*.  Even if we could force them to publish in semantic but
> non-standard document types we would be no farther ahead!
> 
> Trees: XSL being used to destroy semantic information.
> 
> Forest: The hard work of building robust information systems that
> will even *allow* us to share semantics meaningfully.

This argument amounts to a throwing of the hands up in the air and
saying, "It's just too hard.  We shouldn't even try!"  I frankly can't
see what you would lose.  Even in a worst case scenario where everyone
decided to ignore everyone else and began using proprietary doctypes,
you could still point your browser at Lexis-Nexis and display their
documents with their stylesheets, which is no worse than we have now
(and at least the semantics is there for those who know the
structure).  In the real world, people will get together, talk about
it, decide on conventions, argue about whose convention is best and
generally get on with it.  There need be no guessing games involved
(as there must inevitably be with HTML).

It is much safer to say, "This is how to do it." than to say, "Don't
do it!"  People, seeing the enourmous potential of the "semantic web",
will simply ignore the latter advice and do it in any old way and they
will stuff it up.  You cannot prohibit or ban the use of such a
powerful concept no matter how dangerous or difficult it is.  It is
better to jump on top of it and tame it.

> > I have to take issue, however, with the characterisation of the
> > transformations as points on a spectrum.  There is a very well
> > defined distinction between transformation and formatting within
> > the XSL model, hence the move to split it into two separate
> > standards.
> 
> Actually, the two processes in XSL would be better termed
> "transformation" and "layout." Both steps do *formatting*. Choosing
> which text becomes the footer text is certainly formatting but it is
> done by the transformation part of the language.

And I take you back to my original example, there is no continuum
between <name> and <H3> or <employee> and <DIV>.

Moreover, the example you give could easily be handled by separating
the formatting parts of the transformation side into two stages, the
non-formatting-related aspects at the server, and the formatting
aspects at the client.  One might argue that this blurs the
distinction I am trying to make, but you obviously had no trouble
categorically asserting that footers are a formatting construct.  Is
it ever really that difficult to discriminate between the two
concepts?

Cheers,
Marcelo

-- 
http://www.simdb.com/~marcelo/

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)