Word and XML (was: XML standards coherency and so forth)
Tony McDonald
tony.mcdonald at ncl.ac.uk
Thu Feb 11 09:38:13 GMT 1999
>> From: "Rick Jelliffe" <ricko at allette.com.au>
>> Date: Sun, 24 Jan 1999 16:15:36 +1100
>> Subject: Re: Word and XML (was: XML standards coherency and so forth)
>>
>> From: Biron,Paul V <Paul.V.Biron at kp.ORG>
>
> [snip]
> Wow! I've been so busy lately that I haven't been able to keep up with
> XML-DEV and had no idea my "innocent" post on Word and HTML/XML had been so
> long lived!
>
> [snip]
>
> In truth, we've spent a great deal of time writting tools (a big daisy chain
> of FrontPage v1.1 -> hand-roled perl script 1 -> hand-roled perl script 2 ->
> etc.) just to HTML output from Word '97. What has made this all the more
> fustrating for us is that the HTML is not really what we want in the end.
> We just want a "clean" HTML version so that the transformation to the XML
> DTD that we're interested in is "easier". The BOLD and ITALIC that our
> authors see actually represent more "semantic" XML elements, e.g., <allergy>
> and <medication>. Such is life.
I don't know how far down this route you've gone Byron, but can I
suggest using rtf2xml (http://www.sesha.com/omlette/rtf2xml/) - it
uses the limited version of Omnimark http://www.omnimark.com as an
engine and does a very good job of RTF -> XML conversion.
It uses Word paragraph and character styles to convert the RTF into
well-formed and valid XML, eg
<p stylename="List Bullet"
color="1"><pntext>·&tab;</pntext><string color="1">Almanack
& Administration Information </string><string charstyname="URL"
fontsize="20" italic="on"
color="1">http://nme.ncl.ac.uk/almanack/</string><string color="1">
</string></p>
(you can see that additional, formatting, information that was in the
original Word document is provided too).
I then pass this through another omnimark program to get to (be aware
that it's perfectly possible to create invalid and badly-formed XML
at this stage!!);
...
<subsubsection>
<titleinfo class='subsubsection' level='3'>
<title class='subsubsection'>On-line Resources</title>
<sg_title>Organisation of Tissues</sg_title>
</titleinfo>
<subheading>Student Support and Tutoring (Computer Mediated
Communication) Tools:</subheading>
...
<item><text>Almanack & Administration Information
</text><a xml:link='simple'
href='http://nme.ncl.ac.uk/almanack/'>http://nme.ncl.ac.uk/almanack/</
a><text> </text></item>
...
</subsubsection>
>From this XML, the conversion to another HTML (or RTF etc.) format is
(relatively) easy.
I tried using the 'HTML' that Word 'emits' and had to have a lie
down...this scheme of using RTF and well marked up original documents
seems to be helping us along in our up-conversion process (whoever
chose that term knew what they were talking about - it's like
climbing, rather inching up, a vertical cliff face going backwards
with no ropes...great fun)
hth
tone
------
Dr Tony McDonald, FMCC, Networked Learning Environments Project
The Medical School, Newcastle University Tel: +44 191 222 5888
Fingerprint: 3450 876D FA41 B926 D3DD F8C3 F2D0 C3B9 8B38 18A2
xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)
More information about the Xml-dev
mailing list