XSL and the semantic web

Paul Prescod paul at prescod.net
Tue Jun 22 05:23:07 BST 1999

Marcelo Cantos wrote:
> Yes, gratuitously.  In the context of XSL, formatting can be done by a
> style sheet at the client end, hence there is no need to provide the
> final FO's.

Well, David has already pointed out that there are places where the full
XSL engine cannot run on the client. He has also pointed out that there
are business reasons for wanting to keep your semantic data private.

> None of the issues raised here applies to the discussion at hand,
> which is automated processing of content.  This is what the "semantic
> web" is supposed to be about isn't it?  Sure you can word index the
> HTML, but that won't tell you which word is the person's name.
> So let me be more explicit: FO's are, for the purposes of automated
> extraction of semantic content, little or no better than GIF's.

In another message in this thread you said that trying to hide information
from automated extractors (spam-bots) through data-dumbing was a lost
cause. It raised the bar "a half inch off of the ground." Now you're
saying that dumbed-down-text is as hard to process as a GIF. Which is it?

My personal opinion is that dumbed-down-text is not hard to process if you
know the dumbing-down algorithm in advance. But it is very hard to process
if you are trying to write a bot that will *predict* what a random site's
data dumbing algorithm will be like. Trolling the Web for "shoe prices" is
a lot harder when shoe prices are labelled as <P>'s. 
My point of view is that making bot-creation harder is an information
owner's perogerative. Making bot-creation easier is also the information
owner's right. Charging extra money for the bot-friendly version is yet
another right.

> I would be surprised if the content provider went to all the effort of
> exposing semantic markup and then didn't bother to tell anyone what it
> meant.

If their goal is not to share then that is exactly what they would do.
That's my point: even if it were possible (which it isn't) to force people
to share semantically meaningful data, the fact that it is semantically
meaningful *to them* does not mean that it is meaningful *to you* without
sufficiently smart software! Forcing them (as if it were possible) to
distribute semantic data is only the start of the battle.

> > If Lexis-Nexis publishes its terabytes of data in a proprietary
> > document type, it might as well be Greek. HTML is more useful
> > because I can at least display it.
> This is patently false.  All you need is a stylesheet, which it would
> be Lexis-Nexis' responsibility to provide you with if they wanted to
> let you display it (if you are arguing for HTML then obviously they
> want you to be able to display it).

So what you are saying is that you need the information owner's help in
understanding the information. That's what I'm saying also. Just getting
it on the Web is not useful. Information owners need to *want* to build
the semantic web so that they can help us interpret their data.

> > Guessing at the structure of a document type from element type names
> > is as dangerous as guessing based on text content like colons and
> > font sizes. If you want the semantic web to be robust, you need
> > people to WANT to publish semantic data in *standardized document
> > types*.  Even if we could force them to publish in semantic but
> > non-standard document types we would be no farther ahead!
> >
> > Trees: XSL being used to destroy semantic information.
> >
> > Forest: The hard work of building robust information systems that
> > will even *allow* us to share semantics meaningfully.
> This argument amounts to a throwing of the hands up in the air and
> saying, "It's just too hard.  We shouldn't even try!"  

No it isn't. Please read what I wrote above. Where did I say that we
shouldn't try to build a semantic web? If anything, I said that we
shouldn't try to *force organizations* that for some reason do not want to
participate into doing so. Not only is it impossible and ill-conceived, it
is just plain wrong from an economic and moral point of view.

> Moreover, the example you give could easily be handled by separating
> the formatting parts of the transformation side into two stages, the
> non-formatting-related aspects at the server, and the formatting
> aspects at the client.  One might argue that this blurs the
> distinction I am trying to make, but you obviously had no trouble
> categorically asserting that footers are a formatting construct.  Is
> it ever really that difficult to discriminate between the two
> concepts?


Where do you insert boilerplate text? Is that formatting or
transformation? In CSS it is formatting (since CSS doesn't do
transformation) and in XSL it is transformation (since XSL formatting
objects don't have prefixes). 

Where do you label something as being a block or inline? In CSS it is
formatting. In XSL it is transformation.

Where do you re-order the figure and the figure's caption? In some style
languages (not CSS) this is possible without a transformation. In XSL it
is a transformation.

Where do you fetch the text from the other end of a cross-reference and
stick it in the current location? In some style languages that is just a
declaration in a simple style language. In others it is a transformation.

If the only purpose of any transformation is for human display I call it
formatting, no matter how sophisticated or complex it is. If you have some
better distinction between formatting and transformation I would love to
hear it.

 Paul Prescod  - ISOGEN Consulting Engineer speaking for only himself

[Woody Allen on Hollywood in "Annie Hall"]
Annie: "It's so clean down here."
Woody: "That's because they don't throw their garbage away. They make 
        it into television shows."

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)

More information about the Xml-dev mailing list