Word and XML (was: XML standards coherency and so forth)
Paul.V.Biron at kp.ORG
Wed Feb 10 19:04:57 GMT 1999
> From: "Rick Jelliffe" <ricko at allette.com.au>
> Date: Sun, 24 Jan 1999 16:15:36 +1100
> Subject: Re: Word and XML (was: XML standards coherency and so forth)
> From: Biron,Paul V <Paul.V.Biron at kp.ORG>
> >Word 97 also produced several well-formedness violations when doing
> >more than simple nested lists.
> Dave Ragget's program "tidy" is excellent for fixing up badly formed
> HTML and making it valid (it figures out which HTML DTD the document is
> valid according to, and generates the appropriate DOCTYPE for it). It
> also is great for converting to HTML-in-XML (e.g. our website
> www.ascc.net/xml/ uses it).
> The program is available at
> I think website developers should consider making tidy a standard part
> of website maintenance. Each HTML editing program can do strange things
> to markup; using tidy on the maintenance fileset and then updating the
> website fileset is a good way to keep a WF site. without forcing you to
> give up non-WF tools.
Wow! I've been so busy lately that I haven't been able to keep up with
XML-DEV and had no idea my "innocent" post on Word and HTML/XML had been so
On this matter, tidy was one of the first "fix-it" approaches we tried.
Unfortunately, tidy doesn't happen to fix this particular problem. Tidy
does many, many VERY important things! Fixing this problem is not one of
The HTML produced by Word '97 from my example is:
<P>This is <B>a test <I>of the</B> emergency</I> broadcast system</P>
The output produced by tidy (22jan99 version) is:
<P>This is <B>a test <I>of the</I> emergency</B> broadcast system</P>
While this is "well-formed" HTML (it does not contain improper nesting), it
is NOT the output that is wanted. The problem is that in the original, the
BOLD stops after "the" (where it should stop); in the tidy version it
continues until after "emergency".
The output that Word should have originally output is:
<P>This is <B>a test <I>of the</I></B> <I>emergency</I> broadcast system</P>
That is, the fix is to insert a </I> when the </B> is seen and then to
reopen <I> after the </B>. Tidy just replaces the </B> with </I> and then
replaces the original </I> with </B>.
The only tool I've found so far that fixes this problem correctly is
FrontPage v1.1 (about 4 years old, funny they had it working back then:-).
In truth, we've spent a great deal of time writting tools (a big daisy chain
of FrontPage v1.1 -> hand-roled perl script 1 -> hand-roled perl script 2 ->
etc.) just to HTML output from Word '97. What has made this all the more
fustrating for us is that the HTML is not really what we want in the end.
We just want a "clean" HTML version so that the transformation to the XML
DTD that we're interested in is "easier". The BOLD and ITALIC that our
authors see actually represent more "semantic" XML elements, e.g., <allergy>
and <medication>. Such is life.
Paul V. Biron
SGML Business Analyst
Kaiser Permanente, So Cal.
xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)
More information about the Xml-dev