why distinctions within XHTML?

Mark Birbeck Mark.Birbeck at iedigital.net
Wed Sep 1 13:49:51 BST 1999

David Brownell wrote:
> Mark Birbeck wrote:
> > 
> >  6. There are three variants of HTML 4.0 so we need three variants
> >     of 'HTML 4.0 as XML' (let's call it XHTML).
> Isn't that assertion pretty core to this debate?  That is, it's
> not a generally accepted assumption.

If you want to write something that transforms current HTML into XML,
you need to go via an XML version of HTML. Since there are three
versions of HTML, then you need three 'XML versions of HTML'. To me that
says nothing about the future direction of XHTML, or what future
browsers will do, etc. It just says if you want to manipulate current
documents, you have to accept they are in one of three dialects of the
same language.

> This is the "commonality" argument -- we're striving for common
> vocabularies and reuse, broadening markets not restricting them,
> making software general purpose (while allowing specialization
> in those few cases it's needed).

That's good - and I'm sure 'modular' XHTML will address all those
issues. But what about dealing with current problems? We need a mark-up
for transforming current HTML documents to XML. And when I say this, I
don't mean converting:

<TD ALIGN=LEFT>The Thin Red Line <B>1998</td>


<TD ALIGN="LEFT">The Thin Red Line <B>1998</B></TD>

That is converting it to an XML version of HTML, but there is no
meta-information. I mean converting it to:

    <Title>The Thin Red Line</Title>

To do this efficiently is a two-stage process. First 'tidy' the HTML
into something that is XML, but still 'looks like' HTML. If possible
make it always do the same thing if faced with ambiguities. Then, once
in tidyHTML you can parse it into whatever you like. Of course you could
write your own messHTML to filmXML converter, and then I could write my
own messHTML to addressBookXML converter, and so on. But as you say, we
want "reuse" and "commonality".

We therefore need two things to be defined by those rogues at W3C:

- a tidyHTML
- an xmlHTML

The first plays the role I have explained above - allowing us to convert
legacy HTML into lovely, succulent, marked-up data (that thing that we
were all so excited about last week, remember?). The second plays the
role of defining a format for future browsers, such that HTML can be
validated. This is necessary because the HTML may appear inside other
mark-up languages, or may itself contain other mark-up languages. But we
need schemas for this. It *cannot* be done with DTDs, since it is very,
very difficult to have elements from different namespaces in the same
document. A stop-gap today is to embed the documents as you need, but
lose validation. If you do this, at least you have the namespace to help
you, if you are doing any post-parsing processing.

I think that XHTML completely solves the problem of tidyHTML, and it
*begins* to look at the issue of xmlHTML. However, the latter will not
be complete until modularization comes along.

So I think the whole 'how many lines of code must a man walk down' issue
is a red herring, because by the time we come to actually do the things
that people are suggesting will be a problem, they won't be.

Best regards,


xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)

More information about the Xml-dev mailing list