Word and XML (was: XML standards coherency and so forth)

Tony McDonald tony.mcdonald at ncl.ac.uk
Thu Feb 11 09:38:13 GMT 1999


>> From: "Rick Jelliffe" <ricko at allette.com.au>
>> Date: Sun, 24 Jan 1999 16:15:36 +1100
>> Subject: Re: Word and XML (was: XML standards coherency and so forth)
>>
>> From: Biron,Paul V <Paul.V.Biron at kp.ORG>
>
> [snip]
> Wow!  I've been so busy lately that I haven't been able to keep up with
> XML-DEV and had no idea my "innocent" post on Word and HTML/XML had been so
> long lived!
>
> [snip]
>
> In truth, we've spent a great deal of time writting tools (a big daisy chain
> of FrontPage v1.1 -> hand-roled perl script 1 -> hand-roled perl script 2 ->
> etc.) just to HTML output from Word '97.  What has made this all the more
> fustrating for us is that the HTML is not really what we want in the end.
> We just want a "clean" HTML version so that the transformation to the XML
> DTD that we're interested in is "easier".  The BOLD and ITALIC that our
> authors see actually represent more "semantic" XML elements, e.g., <allergy>
> and <medication>.  Such is life.

I don't know how far down this route you've gone Byron, but can I 
suggest using rtf2xml (http://www.sesha.com/omlette/rtf2xml/) - it 
uses the limited version of Omnimark http://www.omnimark.com as an 
engine and does a very good job of RTF -> XML conversion.

It uses Word paragraph and character styles to convert the RTF into 
well-formed and valid XML, eg

<p stylename="List Bullet" 
color="1"><pntext>&#183;&tab;</pntext><string color="1">Almanack 
&amp; Administration Information </string><string charstyname="URL" 
fontsize="20" italic="on" 
color="1">http://nme.ncl.ac.uk/almanack/</string><string color="1"> 
</string></p>

(you can see that additional, formatting, information that was in the 
original Word document is provided too).

I then pass this through another omnimark program to get to (be aware 
that it's perfectly possible to create invalid and badly-formed XML 
at this stage!!);
...
<subsubsection>
<titleinfo class='subsubsection' level='3'>
<title class='subsubsection'>On-line Resources</title>
<sg_title>Organisation of Tissues</sg_title>
</titleinfo>
<subheading>Student Support and Tutoring (Computer Mediated 
Communication) Tools:</subheading>
...
<item><text>Almanack &amp; Administration Information
 </text><a xml:link='simple' 
href='http://nme.ncl.ac.uk/almanack/'>http://nme.ncl.ac.uk/almanack/</ 
a><text>  </text></item>
...
</subsubsection>

>From this XML, the conversion to another HTML (or RTF etc.) format is 
(relatively) easy.

I tried using the 'HTML' that Word 'emits' and had to have a lie 
down...this scheme of using RTF and well marked up original documents 
seems to be helping us along in our up-conversion process (whoever 
chose that term knew what they were talking about - it's like 
climbing, rather inching up, a vertical cliff face going backwards 
with no ropes...great fun)

hth
tone
------
Dr Tony McDonald,  FMCC, Networked Learning Environments Project
The Medical School, Newcastle University Tel: +44 191 222 5888
Fingerprint: 3450 876D FA41 B926 D3DD  F8C3 F2D0 C3B9 8B38 18A2

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)




More information about the Xml-dev mailing list